Semantic Searching
Semantic search allows you to find code based on meaning rather than just keywords. Kit supports semantic code search using vector embeddings and ChromaDB (both local and cloud), enabling you to search for code using natural language queries.
How it works
- Chunks your codebase (by symbols or lines)
- Embeds each chunk using your chosen model (OpenAI, HuggingFace, etc)
- Stores embeddings in a local ChromaDB vector database
- Lets you search for code using natural language or code-like queries
Quick Start
from kit import Repositoryfrom sentence_transformers import SentenceTransformer
# Use any embedding model you likemodel = SentenceTransformer("all-MiniLM-L6-v2")def embed_fn(texts): return model.encode(texts).tolist()
repo = Repository("/path/to/codebase")vs = repo.get_vector_searcher(embed_fn=embed_fn)vs.build_index() # Index all code chunks (run once, or after code changes)
results = repo.search_semantic("How is authentication handled?", embed_fn=embed_fn)for hit in results: print(hit["file"], hit.get("name"), hit.get("type"), hit.get("code"))# Example output:# src/kit/auth.py login function def login(...): ...# src/kit/config.py AUTH_CONFIG variable AUTH_CONFIG = {...}
Configuration
Required: Embedding Function
You must provide an embedding function (embed_fn
) when first accessing semantic search features via repo.get_vector_searcher()
or repo.search_semantic()
.
This function takes a list of text strings and returns a list of corresponding embedding vectors.
from kit import Repository
repo = Repository("/path/to/repo")
# Define the embedding function wrapperdef embed_fn(texts: list[str]) -> list[list[float]]: # Adapt this to your specific embedding library/API return get_embeddings(texts)
# Pass the function when searchingresults = repo.search_semantic("database connection logic", embed_fn=embed_fn)
# Or when getting the searcher explicitlyvector_searcher = repo.get_vector_searcher(embed_fn=embed_fn)
Choosing an Embedding Model
kit
is model-agnostic: pass any function List[str] -> List[List[float]]
.
Local (Open-Source) Models
Use sentence-transformers
models for fast, local inference:
from sentence_transformers import SentenceTransformer
# Popular lightweight model (100 MB-ish download)model = SentenceTransformer("all-MiniLM-L6-v2")def embed_fn(texts: list[str]) -> list[list[float]]: return model.encode(texts).tolist()
# Or try larger, more accurate modelsmodel = SentenceTransformer("all-mpnet-base-v2") # ~420MB, better quality
Cloud API Models
Use OpenAI or other cloud embedding services:
import openai
def embed_fn(texts: list[str]) -> list[list[float]]: """OpenAI embedding function with batching support.""" response = openai.embeddings.create( model="text-embedding-3-small", input=texts ) return [data.embedding for data in response.data]
# Alternative: single text fallback for simple APIsdef embed_fn_single(texts: list[str]) -> list[list[float]]: """If your API only supports single strings.""" embeddings = [] for text in texts: resp = openai.embeddings.create(model="text-embedding-3-small", input=[text]) embeddings.append(resp.data[0].embedding) return embeddings
Batching Support
VectorSearcher
will attempt to call your embed_fn
with a list of texts for efficiency. If your function only supports single strings, it still works (falls back internally).
Backend Configuration
kit
’s VectorSearcher
uses a pluggable backend system for storing and querying vector embeddings. Kit supports both local ChromaDB storage and Chroma Cloud for managed vector databases.
Local ChromaDB (Default)
When you initialize VectorSearcher
without specifying a backend
argument, kit
automatically uses an instance of ChromaDBBackend
for local storage.
Configuration Options:
persist_dir
(Optional[str]): Specifies where the ChromaDB index will be stored on disk.- If you provide a path:
repo.get_vector_searcher(persist_dir="./my_index")
- If no
persist_dir
is specified, defaults toYOUR_REPO_PATH/.kit/vector_db/
- Persisting the index allows you to reuse it across sessions without re-indexing
- If you provide a path:
# Example: Initialize with custom persist directoryvector_searcher = repo.get_vector_searcher( embed_fn=my_embedding_function, persist_dir="./my_custom_kit_vector_index")
# Building the index (first time or to update)vector_searcher.build_index()
# Later, to reuse the persisted index:vector_searcher_reloaded = repo.get_vector_searcher( embed_fn=my_embedding_function, persist_dir="./my_custom_kit_vector_index")results = vector_searcher_reloaded.search("my query")
Chroma Cloud (Managed Service)
Kit supports Chroma Cloud, a fully managed vector database service, as an alternative to local ChromaDB storage. This enables better scalability, team collaboration, and eliminates local storage constraints for large codebases.
Important: The API remains exactly the same whether you’re using local ChromaDB or Chroma Cloud. The only difference is which Chroma server you’re connecting to - a local single-node instance or the managed Cloud service. Your code doesn’t need to change, just the environment configuration.
Prerequisites
- ChromaDB version 1.0.0 or higher (
pip install chromadb>=1.0.0
) - Python 3.10+
- Chroma Cloud account (sign up at https://trychroma.com/signup)
Configuration
Set the following environment variables to enable Chroma Cloud:
# Required: Enable Chroma Cloud backendexport KIT_USE_CHROMA_CLOUD="true"
# Required: Your Chroma Cloud API keyexport CHROMA_API_KEY="your-api-key-here"
# Required: Your tenant UUID (must be valid UUID format)export CHROMA_TENANT="3893b771-b971-4f45-8e30-7aac7837ad7f"
# Required: Your database name (create in dashboard first)export CHROMA_DATABASE="kit-codebase-index"
Note: The tenant must be a valid UUID. Kit will validate the format and provide a clear error if an invalid tenant is provided. Find your tenant UUID in your Chroma Cloud dashboard.
Usage
Once the environment variables are set, Kit will automatically use Chroma Cloud:
from kit import Repositoryfrom kit.vector_searcher import VectorSearcher
# Initialize repositoryrepo = Repository("/path/to/repo")
# Create vector searcher - will use Chroma Cloud automaticallysearcher = VectorSearcher(repo, embed_fn=my_embed_function)
# Build index (stored in cloud)searcher.build_index()
# Search (queries cloud backend)results = searcher.search("find authentication logic", top_k=5)
Programmatic Configuration
You can also explicitly create a cloud backend:
from kit.vector_searcher import ChromaCloudBackend, VectorSearcher
# Create cloud backend explicitlybackend = ChromaCloudBackend( api_key="your-api-key", tenant="3893b771-b971-4f45-8e30-7aac7837ad7f", # UUID format database="kit-codebase-index", collection_name="my_project_index")
# Use with VectorSearchersearcher = VectorSearcher(repo, embed_fn=my_embed_function, backend=backend)
Switching Between Local and Cloud
Kit determines which backend to use based on the KIT_USE_CHROMA_CLOUD
environment variable:
KIT_USE_CHROMA_CLOUD=true
: Uses Chroma Cloud (requiresCHROMA_API_KEY
)KIT_USE_CHROMA_CLOUD=false
or unset: Uses local ChromaDB storage
# Switch to cloudexport KIT_USE_CHROMA_CLOUD="true"
# Switch back to localexport KIT_USE_CHROMA_CLOUD="false"# orunset KIT_USE_CHROMA_CLOUD
Benefits of Chroma Cloud
- Team Collaboration: Share indexes across team members
- No Storage Limits: Scale to massive codebases without local disk constraints
- Managed Service: No infrastructure to maintain
- Persistence: Data persists across machines and sessions
- Performance: Optimized cloud infrastructure for vector operations
Migration
To migrate from local to cloud:
- Set up Chroma Cloud credentials
- Rebuild your indexes (they’ll be stored in cloud)
To migrate from cloud to local:
- Unset
KIT_USE_CHROMA_CLOUD
- Rebuild your indexes locally
Other Backends
While the VectorDBBackend
interface is designed to support other vector databases, ChromaDB (local and cloud) is the primary focus for now. If you need other backends like Faiss, please raise an issue on the kit GitHub repository.
Usage Patterns
Chunking Strategy
Control how your code is broken into searchable chunks:
# Default: chunk by symbols (functions, classes, variables)vs.build_index(chunk_by="symbols")
# Alternative: chunk by lines (~50-line blocks)vs.build_index(chunk_by="lines") # Useful for unsupported languages
chunk_by="symbols"
(default) extracts functions/classes/variables via the existing AST parser. This is usually what you want.
You can re-index at any time; the previous collection is cleared automatically.
Persisting & Re-using an Index
The index lives under .kit/vector_db
by default (one Chroma collection per path).
vs = repo.get_vector_searcher(embed_fn, persist_dir=".kit/my_index")vs.build_index()# … later …searcher = repo.get_vector_searcher(embed_fn, persist_dir=".kit/my_index")results = searcher.search("add user authentication")
Docstring Index
Prefer meaning-first search? Instead of embedding raw code you can build an index of LLM-generated summaries:
DocstringIndexer → SummarySearcher
See Docstring-Based Vector Index for details.
Feeding Results to an LLM
Combine VectorSearcher
with ContextAssembler
to build an LLM prompt containing only relevant code:
from kit import ContextAssembler
chunks = repo.search_semantic("jwt auth flow", embed_fn=embed_fn, top_k=10)assembler = ContextAssembler(max_chars=12_000)context = assembler.from_chunks(chunks)llm_response = my_llm.chat(prompt + context)
Advanced Usage Examples
Multi-Query Search
queries = [ "database connection setup", "user authentication logic", "error handling patterns"]
all_results = []for query in queries: results = repo.search_semantic(query, embed_fn=embed_fn, top_k=5) all_results.extend(results)
# Deduplicate by file pathunique_files = {r["file"]: r for r in all_results}
Filtering Results
# Search only in specific directoriesresults = repo.search_semantic("api endpoints", embed_fn=embed_fn)api_results = [r for r in results if "src/api/" in r["file"]]
# Search only for functionsfunction_results = [r for r in results if r.get("type") == "function"]
Best Practices
Performance Tips
- Index size: Indexing a very large monorepo may take minutes. Consider running on CI and committing
.kit/vector_db
. - Chunking: Use
chunk_by="symbols"
for better semantic boundaries - Model selection: Balance model size vs. quality based on your needs
- Batch embedding: Use APIs that support batch embedding for better performance
Search Quality
- Clean code: Embeddings are language-agnostic – comments & docs influence similarity too. Clean code/comments improve search.
- Query formulation: Use natural language descriptions of what you’re looking for
- Combine approaches: Exact-keyword search (
repo.search_text()
) can still be faster for quick look-ups; combine both techniques.
Production Considerations
# Example: Production-ready setup with error handlingimport logging
def safe_semantic_search(repo_path: str, query: str, top_k: int = 5): try: repo = Repository(repo_path)
# Check if index exists vector_searcher = repo.get_vector_searcher(embed_fn=embed_fn)
# Build index if needed (check if collection is empty) try: test_results = vector_searcher.search("test", top_k=1) if not test_results: logging.info("Building semantic index...") vector_searcher.build_index() except Exception: logging.info("Building semantic index...") vector_searcher.build_index()
return repo.search_semantic(query, embed_fn=embed_fn, top_k=top_k)
except Exception as e: logging.error(f"Semantic search failed: {e}") # Fallback to text search return repo.search_text(query)
CLI Support
Semantic search is now available via the kit search-semantic
command:
# Basic semantic searchkit search-semantic /path/to/repo "authentication logic"
# Advanced optionskit search-semantic /path/to/repo "error handling patterns" \ --top-k 10 \ --embedding-model all-mpnet-base-v2 \ --chunk-by symbols
See the CLI documentation for complete usage details.
Limitations & Future Plans
- Language support: Works with any language that kit can parse, but quality depends on symbol extraction
- Index management: Future versions may include index cleanup, optimization, and migration tools