Configuring Semantic Search
Semantic search allows you to find code based on meaning rather than just keywords. To enable this in kit
, you need to configure a vector embedding model and potentially a vector database backend.
Required: Embedding Function
Section titled “Required: Embedding Function”You must provide an embedding function (embed_fn
) when first accessing semantic search features via repo.get_vector_searcher()
or repo.search_semantic()
.
This function takes a list of text strings and returns a list of corresponding embedding vectors.
from kit import Repo# Example using a hypothetical embedding functionfrom my_embedding_library import get_embeddings
repo = Repo("/path/to/repo")
# Define the embedding function wrapper if necessarydef embed_fn(texts: list[str]) -> list[list[float]]: # Adapt this to your specific embedding library/API return get_embeddings(texts)
# Pass the function when searchingresults = repo.search_semantic("database connection logic", embed_fn=embed_fn)
# Or when getting the searcher explicitlyvector_searcher = repo.get_vector_searcher(embed_fn=embed_fn)
Popular choices include models from OpenAI, Cohere, or open-source models via libraries like Hugging Face’s sentence-transformers
.
Backend Configuration
Section titled “Backend Configuration”kit
’s VectorSearcher
uses a pluggable backend system for storing and querying vector embeddings. Currently, the primary supported and default backend is ChromaDB.
ChromaDB (Default)
Section titled “ChromaDB (Default)”When you initialize VectorSearcher
(typically via repo.get_vector_searcher()
) without specifying a backend
argument, kit
automatically uses an instance of ChromaDBBackend
.
Configuration Options:
persist_dir
(Optional[str]): This is the most important configuration option. It specifies the directory where the ChromaDB index will be stored on disk.- If you provide a path to
repo.get_vector_searcher(persist_dir=...)
or directly to theVectorSearcher
constructor, that path will be used. - If no
persist_dir
is specified,kit
defaults to creating the index in a subdirectory within your repository, typically atYOUR_REPO_PATH/.kit/vector_db/
. - Persisting the index allows you to reuse it across sessions without needing to re-embed and re-index your codebase every time.
- If you provide a path to
At present, other ChromaDB-specific configurations (like collection names or distance metrics) are managed internally by kit
with default settings. Future versions may expose more fine-grained control.
# Example: Initialize with default ChromaDB backend and specify a persist directoryvector_searcher = repo.get_vector_searcher( embed_fn=my_embedding_function, persist_dir="./my_custom_kit_vector_index" # Index will be saved here)
# Building the index (first time or to update)vector_searcher.build_index()
# Later, to reuse the persisted index:# Ensure you use the same embed_fn and persist_dirvector_searcher_reloaded = repo.get_vector_searcher( embed_fn=my_embedding_function, persist_dir="./my_custom_kit_vector_index")results = vector_searcher_reloaded.search("my query")
Other Backends
Section titled “Other Backends”While the VectorDBBackend
interface is designed to support other vector databases, ChromaDB is the primary focus for now. If you have a need for other backends like Faiss (especially for purely in-memory, non-persisted use cases) or others, please raise an issue on the kit
GitHub repository.
Choosing an Embedding Model
Section titled “Choosing an Embedding Model”Popular choices include models from OpenAI, Cohere, or open-source models via libraries like Hugging Face’s sentence-transformers
.