pinecone | castform docs

pinecone is a managed vector database. use this if you already have data in pinecone or want managed vector search without running your own infrastructure. castform does not provide pinecone access; you’ll need your own api key.

when to use pinecone

you already have data indexed in pinecone
you want managed vector search with no servers to run
you want to use pinecone’s hosted inference for embeddings (no need to provide your own embedding function)

1. create your corpus

use PineconeChunkSource to upload and index your documents:

from trainer.corpus.pinecone.source import PineconeChunkSource

source = PineconeChunkSource(
    api_key="pc_...",
    index_name="my-docs",
)

# upload from a local folder
source.populate_from_folder("./docs/")

# or upload pre-chunked data
source.populate_from_chunks(chunks)

embeddings

by default, pinecone uses its hosted inference API (multilingual-e5-large) for embeddings. to use a custom embedding function:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

source = PineconeChunkSource(
    api_key="pc_...",
    index_name="my-docs",
    embed_fn=model.encode,
)

bring your own index (BYOI)

if you already have data in pinecone with custom metadata field names, use field_mapping to map them to castform’s expected fields:

source = PineconeChunkSource(
    api_key="pc_...",
    index_name="existing-index",
    field_mapping={
        "content": "body_text",     # your field name for chunk content
        "file_path": "source_file", # your field name for file path
    },
)

PineconeChunkSource parameters

parameter	default	description
`api_key`	required	Pinecone API key
`index_name`	required	name of the Pinecone index
`index_host`	`None`	direct host URL (skips index lookup)
`namespace`	`""`	Pinecone namespace to use
`embed_fn`	`None`	custom embedding function; overrides hosted inference
`embed_model`	`"multilingual-e5-large"`	Pinecone hosted inference model (used when no `embed_fn`)
`field_mapping`	`None`	maps custom metadata field names for BYOI

2. generate QA data

pass your PineconeChunkSource to the QA generation pipeline as usual. it implements the standard ChunkSource interface:

from trainer.qa_generation.cgft_models import CgftPipelineConfig, PlatformConfig, CorpusConfig, TargetsConfig
from trainer.qa_generation.cgft_pipeline import CgftPipeline

cfg = CgftPipelineConfig(
    platform=PlatformConfig(api_key="sk_..."),
    corpus=CorpusConfig(corpus_name="my-docs", docs_path="./docs/"),
    targets=TargetsConfig(total_samples=200),
)
cfg.resolve_api_keys()

pipeline = CgftPipeline(cfg, corpus_source=source)
result = pipeline.run()

3. train with SearchEnv

create a PineconeSearch client and pass it to SearchEnv:

from trainer.corpus.pinecone.search import PineconeSearch
from trainer.envs.search_env import SearchEnv
from trainer.trainer.pipeline import train

search = PineconeSearch(
    api_key="pc_...",
    index_name="my-docs",
    # uses Pinecone hosted inference by default
    # or pass embed_fn for custom embeddings
)

experiment_id = train(
    env_class=SearchEnv,
    env_args={"search": search},
    train_dataset=result["train_dataset"],
    eval_dataset=result["eval_dataset"],
    prefix="pinecone-search",
    api_key="sk_...",
)

PineconeSearch parameters

parameter	default	description
`api_key`	required	Pinecone API key
`index_name`	required	name of the Pinecone index
`index_host`	`None`	direct host URL (skips index lookup)
`namespace`	`""`	Pinecone namespace
`embed_fn`	`None`	custom embedding function
`embed_model`	`"multilingual-e5-large"`	Pinecone hosted inference model (used when no `embed_fn`)
`field_mapping`	`None`	maps custom metadata field names for BYOI (bring your own index)

notes

vector-only search: pinecone supports vector search only. if you need lexical or hybrid modes, consider turbopuffer or chroma
hosted inference: when no embed_fn is provided, pinecone uses its own inference API with embed_model (default: multilingual-e5-large). this means you don’t need to run or manage an embedding model
PineconeSearch is pickle-safe. it stores only connection parameters and reconstructs the SDK client lazily after unpickling
PineconeChunkSource and PineconeSearch are separate: the source handles corpus creation and QA generation, the search client handles training