Skip to content

Semantic Searching

⚠ Experimental: Vector / semantic search is an early feature. APIs, CLI commands, and index formats may change in future releases without notice.

kit supports semantic code search using vector embeddings and ChromaDB.

  • Chunks your codebase (by symbols or lines)
  • Embeds each chunk using your chosen model (OpenAI, HuggingFace, etc)
  • Stores embeddings in a local ChromaDB vector database
  • Lets you search for code using natural language or code-like queries
from kit import Repository
from sentence_transformers import SentenceTransformer
# Use any embedding model you like
model = SentenceTransformer("all-MiniLM-L6-v2")
def embed_fn(text):
return model.encode([text])[0].tolist()
repo = Repository("/path/to/codebase")
vs = repo.get_vector_searcher(embed_fn=embed_fn)
vs.build_index() # Index all code chunks (run once, or after code changes)
results = repo.search_semantic("How is authentication handled?", embed_fn=embed_fn)
for hit in results:
print(hit["file"], hit.get("name"), hit.get("type"), hit.get("code"))
# Example output:
# src/kit/auth.py login function def login(...): ...
# src/kit/config.py AUTH_CONFIG variable AUTH_CONFIG = {...}

kit is model-agnostic: pass any function str -> List[float].

  • Local (open-source) – e.g. sentence-transformers models like all-MiniLM-L6-v2 (see example above). 100 MB-ish download, fast CPU inference.
  • Cloud API – e.g. OpenAI text-embedding-3-small. Define:
    import openai
    def embed_fn(text: str):
    resp = openai.embeddings.create(model="text-embedding-3-small", input=[text])
    return resp.data[0].embedding
  • BatchingVectorSearcher will attempt to call your embed_fn with a list of texts for efficiency. If your function only supports single strings it still works (falls back internally).

vs.build_index(chunk_by="symbols") (default) extracts functions/classes/variables via the existing AST parser. This is usually what you want.

  • chunk_by="lines" splits code into ~50-line blocks – useful for languages where symbol extraction isn’t supported yet.
  • You can re-index at any time; the previous collection is cleared automatically.

The index lives under .kit/vector_db by default (one Chroma collection per path).

vs = repo.get_vector_searcher(embed_fn, persist_dir=".kit/my_index")
vs.build_index()
# … later …
searcher = repo.get_vector_searcher(embed_fn, persist_dir=".kit/my_index")
results = searcher.search("add user authentication")

Prefer meaning-first search? Instead of embedding raw code you can build an index of LLM-generated summaries:

DocstringIndexer → SummarySearcher

See Docstring-Based Vector Index for details.

Combine VectorSearcher with ContextAssembler to build an LLM prompt containing only relevant code:

from kit import ContextAssembler
chunks = repo.search_semantic("jwt auth flow", embed_fn=embed_fn, top_k=10)
assembler = ContextAssembler(max_chars=12_000)
context = assembler.from_chunks(chunks)
llm_response = my_llm.chat(prompt + context)
  • Indexing a very large monorepo may take minutes: consider running on CI and committing .kit/vector_db.
  • Embeddings are language-agnostic – comments & docs influence similarity too. Clean code/comments improve search.
  • Exact-keyword search (repo.search_text()) can still be faster for quick look-ups; combine both techniques.