Semantic Searching
⚠ Experimental: Vector / semantic search is an early feature. APIs, CLI commands, and index formats may change in future releases without notice.
kit supports semantic code search using vector embeddings and ChromaDB.
How it works
Section titled “How it works”- Chunks your codebase (by symbols or lines)
- Embeds each chunk using your chosen model (OpenAI, HuggingFace, etc)
- Stores embeddings in a local ChromaDB vector database
- Lets you search for code using natural language or code-like queries
Example Usage
Section titled “Example Usage”from kit import Repositoryfrom sentence_transformers import SentenceTransformer
# Use any embedding model you likemodel = SentenceTransformer("all-MiniLM-L6-v2")def embed_fn(text): return model.encode([text])[0].tolist()
repo = Repository("/path/to/codebase")vs = repo.get_vector_searcher(embed_fn=embed_fn)vs.build_index() # Index all code chunks (run once, or after code changes)
results = repo.search_semantic("How is authentication handled?", embed_fn=embed_fn)for hit in results: print(hit["file"], hit.get("name"), hit.get("type"), hit.get("code"))# Example output:# src/kit/auth.py login function def login(...): ...# src/kit/config.py AUTH_CONFIG variable AUTH_CONFIG = {...}
Choosing an Embedding Model
Section titled “Choosing an Embedding Model”kit
is model-agnostic: pass any function str -> List[float]
.
- Local (open-source) – e.g.
sentence-transformers
models likeall-MiniLM-L6-v2
(see example above). 100 MB-ish download, fast CPU inference. - Cloud API – e.g. OpenAI
text-embedding-3-small
. Define:import openaidef embed_fn(text: str):resp = openai.embeddings.create(model="text-embedding-3-small", input=[text])return resp.data[0].embedding - Batching –
VectorSearcher
will attempt to call yourembed_fn
with a list of texts for efficiency. If your function only supports single strings it still works (falls back internally).
Chunking Strategy
Section titled “Chunking Strategy”vs.build_index(chunk_by="symbols")
(default) extracts functions/classes/variables via the existing AST parser. This is usually what you want.
chunk_by="lines"
splits code into ~50-line blocks – useful for languages where symbol extraction isn’t supported yet.- You can re-index at any time; the previous collection is cleared automatically.
Persisting & Re-using an Index
Section titled “Persisting & Re-using an Index”The index lives under .kit/vector_db
by default (one Chroma collection per path).
vs = repo.get_vector_searcher(embed_fn, persist_dir=".kit/my_index")vs.build_index()# … later …searcher = repo.get_vector_searcher(embed_fn, persist_dir=".kit/my_index")results = searcher.search("add user authentication")
Docstring index
Section titled “Docstring index”Prefer meaning-first search? Instead of embedding raw code you can build an index of LLM-generated summaries:
DocstringIndexer → SummarySearcher
See Docstring-Based Vector Index for details.
Feeding Results to an LLM
Section titled “Feeding Results to an LLM”Combine VectorSearcher
with ContextAssembler
to build an LLM prompt containing only relevant code:
from kit import ContextAssemblerchunks = repo.search_semantic("jwt auth flow", embed_fn=embed_fn, top_k=10)assembler = ContextAssembler(max_chars=12_000)context = assembler.from_chunks(chunks)llm_response = my_llm.chat(prompt + context)
Limitations & Tips
Section titled “Limitations & Tips”- Indexing a very large monorepo may take minutes: consider running on CI and committing
.kit/vector_db
. - Embeddings are language-agnostic – comments & docs influence similarity too. Clean code/comments improve search.
- Exact-keyword search (
repo.search_text()
) can still be faster for quick look-ups; combine both techniques.