Semantic Searching
Semantic search allows you to find code based on meaning rather than just keywords. Kit supports semantic code search using vector embeddings and ChromaDB, enabling you to search for code using natural language queries.
How it works
- Chunks your codebase (by symbols or lines)
- Embeds each chunk using your chosen model (OpenAI, HuggingFace, etc)
- Stores embeddings in a local ChromaDB vector database
- Lets you search for code using natural language or code-like queries
Quick Start
from kit import Repositoryfrom sentence_transformers import SentenceTransformer
# Use any embedding model you likemodel = SentenceTransformer("all-MiniLM-L6-v2")def embed_fn(texts): return model.encode(texts).tolist()
repo = Repository("/path/to/codebase")vs = repo.get_vector_searcher(embed_fn=embed_fn)vs.build_index() # Index all code chunks (run once, or after code changes)
results = repo.search_semantic("How is authentication handled?", embed_fn=embed_fn)for hit in results: print(hit["file"], hit.get("name"), hit.get("type"), hit.get("code"))# Example output:# src/kit/auth.py login function def login(...): ...# src/kit/config.py AUTH_CONFIG variable AUTH_CONFIG = {...}
Configuration
Required: Embedding Function
You must provide an embedding function (embed_fn
) when first accessing semantic search features via repo.get_vector_searcher()
or repo.search_semantic()
.
This function takes a list of text strings and returns a list of corresponding embedding vectors.
from kit import Repository
repo = Repository("/path/to/repo")
# Define the embedding function wrapperdef embed_fn(texts: list[str]) -> list[list[float]]: # Adapt this to your specific embedding library/API return get_embeddings(texts)
# Pass the function when searchingresults = repo.search_semantic("database connection logic", embed_fn=embed_fn)
# Or when getting the searcher explicitlyvector_searcher = repo.get_vector_searcher(embed_fn=embed_fn)
Choosing an Embedding Model
kit
is model-agnostic: pass any function List[str] -> List[List[float]]
.
Local (Open-Source) Models
Use sentence-transformers
models for fast, local inference:
from sentence_transformers import SentenceTransformer
# Popular lightweight model (100 MB-ish download)model = SentenceTransformer("all-MiniLM-L6-v2")def embed_fn(texts: list[str]) -> list[list[float]]: return model.encode(texts).tolist()
# Or try larger, more accurate modelsmodel = SentenceTransformer("all-mpnet-base-v2") # ~420MB, better quality
Cloud API Models
Use OpenAI or other cloud embedding services:
import openai
def embed_fn(texts: list[str]) -> list[list[float]]: """OpenAI embedding function with batching support.""" response = openai.embeddings.create( model="text-embedding-3-small", input=texts ) return [data.embedding for data in response.data]
# Alternative: single text fallback for simple APIsdef embed_fn_single(texts: list[str]) -> list[list[float]]: """If your API only supports single strings.""" embeddings = [] for text in texts: resp = openai.embeddings.create(model="text-embedding-3-small", input=[text]) embeddings.append(resp.data[0].embedding) return embeddings
Batching Support
VectorSearcher
will attempt to call your embed_fn
with a list of texts for efficiency. If your function only supports single strings, it still works (falls back internally).
Backend Configuration
kit
’s VectorSearcher
uses a pluggable backend system for storing and querying vector embeddings. Currently, the primary supported and default backend is ChromaDB.
ChromaDB (Default)
When you initialize VectorSearcher
without specifying a backend
argument, kit
automatically uses an instance of ChromaDBBackend
.
Configuration Options:
persist_dir
(Optional[str]): Specifies where the ChromaDB index will be stored on disk.- If you provide a path:
repo.get_vector_searcher(persist_dir="./my_index")
- If no
persist_dir
is specified, defaults toYOUR_REPO_PATH/.kit/vector_db/
- Persisting the index allows you to reuse it across sessions without re-indexing
- If you provide a path:
# Example: Initialize with custom persist directoryvector_searcher = repo.get_vector_searcher( embed_fn=my_embedding_function, persist_dir="./my_custom_kit_vector_index")
# Building the index (first time or to update)vector_searcher.build_index()
# Later, to reuse the persisted index:vector_searcher_reloaded = repo.get_vector_searcher( embed_fn=my_embedding_function, persist_dir="./my_custom_kit_vector_index")results = vector_searcher_reloaded.search("my query")
Other Backends
While the VectorDBBackend
interface is designed to support other vector databases, ChromaDB is the primary focus for now. If you need other backends like Faiss, please raise an issue on the kit GitHub repository.
Usage Patterns
Chunking Strategy
Control how your code is broken into searchable chunks:
# Default: chunk by symbols (functions, classes, variables)vs.build_index(chunk_by="symbols")
# Alternative: chunk by lines (~50-line blocks)vs.build_index(chunk_by="lines") # Useful for unsupported languages
chunk_by="symbols"
(default) extracts functions/classes/variables via the existing AST parser. This is usually what you want.
You can re-index at any time; the previous collection is cleared automatically.
Persisting & Re-using an Index
The index lives under .kit/vector_db
by default (one Chroma collection per path).
vs = repo.get_vector_searcher(embed_fn, persist_dir=".kit/my_index")vs.build_index()# … later …searcher = repo.get_vector_searcher(embed_fn, persist_dir=".kit/my_index")results = searcher.search("add user authentication")
Docstring Index
Prefer meaning-first search? Instead of embedding raw code you can build an index of LLM-generated summaries:
DocstringIndexer → SummarySearcher
See Docstring-Based Vector Index for details.
Feeding Results to an LLM
Combine VectorSearcher
with ContextAssembler
to build an LLM prompt containing only relevant code:
from kit import ContextAssembler
chunks = repo.search_semantic("jwt auth flow", embed_fn=embed_fn, top_k=10)assembler = ContextAssembler(max_chars=12_000)context = assembler.from_chunks(chunks)llm_response = my_llm.chat(prompt + context)
Advanced Usage Examples
Multi-Query Search
queries = [ "database connection setup", "user authentication logic", "error handling patterns"]
all_results = []for query in queries: results = repo.search_semantic(query, embed_fn=embed_fn, top_k=5) all_results.extend(results)
# Deduplicate by file pathunique_files = {r["file"]: r for r in all_results}
Filtering Results
# Search only in specific directoriesresults = repo.search_semantic("api endpoints", embed_fn=embed_fn)api_results = [r for r in results if "src/api/" in r["file"]]
# Search only for functionsfunction_results = [r for r in results if r.get("type") == "function"]
Best Practices
Performance Tips
- Index size: Indexing a very large monorepo may take minutes. Consider running on CI and committing
.kit/vector_db
. - Chunking: Use
chunk_by="symbols"
for better semantic boundaries - Model selection: Balance model size vs. quality based on your needs
- Batch embedding: Use APIs that support batch embedding for better performance
Search Quality
- Clean code: Embeddings are language-agnostic – comments & docs influence similarity too. Clean code/comments improve search.
- Query formulation: Use natural language descriptions of what you’re looking for
- Combine approaches: Exact-keyword search (
repo.search_text()
) can still be faster for quick look-ups; combine both techniques.
Production Considerations
# Example: Production-ready setup with error handlingimport logging
def safe_semantic_search(repo_path: str, query: str, top_k: int = 5): try: repo = Repository(repo_path)
# Check if index exists vector_searcher = repo.get_vector_searcher(embed_fn=embed_fn)
# Build index if needed (check if collection is empty) try: test_results = vector_searcher.search("test", top_k=1) if not test_results: logging.info("Building semantic index...") vector_searcher.build_index() except Exception: logging.info("Building semantic index...") vector_searcher.build_index()
return repo.search_semantic(query, embed_fn=embed_fn, top_k=top_k)
except Exception as e: logging.error(f"Semantic search failed: {e}") # Fallback to text search return repo.search_text(query)
Limitations & Future Plans
- CLI support: The CLI (
kit search
/kit serve
) currently performs text search only. A semantic variant is planned. - Language support: Works with any language that kit can parse, but quality depends on symbol extraction
- Index management: Future versions may include index cleanup, optimization, and migration tools