pinecone is a managed vector database. use this if you already have data in pinecone or want managed vector search without running your own infrastructure. castform does not provide pinecone access; you’ll need your own api key.
when to use pinecone
- you already have data indexed in pinecone
- you want managed vector search with no servers to run
- you want to use pinecone’s hosted inference for embeddings (no need to provide your own embedding function)
1. create your corpus
use PineconeChunkSource to upload and index your documents:
from trainer.corpus.pinecone.source import PineconeChunkSource
source = PineconeChunkSource(
api_key="pc_...",
index_name="my-docs",
)
# upload from a local folder
source.populate_from_folder("./docs/")
# or upload pre-chunked data
source.populate_from_chunks(chunks)
embeddings
by default, pinecone uses its hosted inference API (multilingual-e5-large) for embeddings. to use a custom embedding function:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
source = PineconeChunkSource(
api_key="pc_...",
index_name="my-docs",
embed_fn=model.encode,
)
bring your own index (BYOI)
if you already have data in pinecone with custom metadata field names, use field_mapping to map them to castform’s expected fields:
source = PineconeChunkSource(
api_key="pc_...",
index_name="existing-index",
field_mapping={
"content": "body_text", # your field name for chunk content
"file_path": "source_file", # your field name for file path
},
)
PineconeChunkSource parameters
| parameter | default | description |
|---|---|---|
api_key | required | Pinecone API key |
index_name | required | name of the Pinecone index |
index_host | None | direct host URL (skips index lookup) |
namespace | "" | Pinecone namespace to use |
embed_fn | None | custom embedding function; overrides hosted inference |
embed_model | "multilingual-e5-large" | Pinecone hosted inference model (used when no embed_fn) |
field_mapping | None | maps custom metadata field names for BYOI |
2. generate QA data
pass your PineconeChunkSource to the QA generation pipeline as usual. it implements the standard ChunkSource interface:
from trainer.qa_generation.cgft_models import CgftPipelineConfig, PlatformConfig, CorpusConfig, TargetsConfig
from trainer.qa_generation.cgft_pipeline import CgftPipeline
cfg = CgftPipelineConfig(
platform=PlatformConfig(api_key="sk_..."),
corpus=CorpusConfig(corpus_name="my-docs", docs_path="./docs/"),
targets=TargetsConfig(total_samples=200),
)
cfg.resolve_api_keys()
pipeline = CgftPipeline(cfg, corpus_source=source)
result = pipeline.run()
3. train with SearchEnv
create a PineconeSearch client and pass it to SearchEnv:
from trainer.corpus.pinecone.search import PineconeSearch
from trainer.envs.search_env import SearchEnv
from trainer.trainer.pipeline import train
search = PineconeSearch(
api_key="pc_...",
index_name="my-docs",
# uses Pinecone hosted inference by default
# or pass embed_fn for custom embeddings
)
experiment_id = train(
env_class=SearchEnv,
env_args={"search": search},
train_dataset=result["train_dataset"],
eval_dataset=result["eval_dataset"],
prefix="pinecone-search",
api_key="sk_...",
)
PineconeSearch parameters
| parameter | default | description |
|---|---|---|
api_key | required | Pinecone API key |
index_name | required | name of the Pinecone index |
index_host | None | direct host URL (skips index lookup) |
namespace | "" | Pinecone namespace |
embed_fn | None | custom embedding function |
embed_model | "multilingual-e5-large" | Pinecone hosted inference model (used when no embed_fn) |
field_mapping | None | maps custom metadata field names for BYOI (bring your own index) |
notes
- vector-only search: pinecone supports vector search only. if you need lexical or hybrid modes, consider turbopuffer or chroma
- hosted inference: when no
embed_fnis provided, pinecone uses its own inference API withembed_model(default:multilingual-e5-large). this means you don’t need to run or manage an embedding model PineconeSearchis pickle-safe. it stores only connection parameters and reconstructs the SDK client lazily after unpicklingPineconeChunkSourceandPineconeSearchare separate: the source handles corpus creation and QA generation, the search client handles training