DocstringIndexer API
The DocstringIndexer
class is responsible for building a vector index of AI-generated code summaries (docstrings). It processes files in a repository, generates summaries for code symbols (or entire files), embeds these summaries, and stores them in a configurable vector database backend. Once an index is built, it can be queried using the SummarySearcher
class.
Constructor
Section titled “Constructor”Class: DocstringIndexer
(defined in kit/docstring_indexer.py
)
from kit import Repository, Summarizerfrom kit.docstring_indexer import DocstringIndexer, EmbedFn # EmbedFn is Optional[Callable[[str], List[float]]]from kit.vector_searcher import VectorDBBackend # Optional
# Example basic initializationrepo = Repository("/path/to/your/repo")summarizer = Summarizer() # Assumes OPENAI_API_KEY is set or local model configuredindexer = DocstringIndexer(repo=repo, summarizer=summarizer)
# Example with custom embedding function and backend# def my_custom_embed_fn(text: str) -> List[float]:# # ... your embedding logic ...# return [0.1, 0.2, ...]## from kit.vector_searcher import ChromaDBBackend# custom_backend = ChromaDBBackend(collection_name="my_custom_index", persist_dir="./my_chroma_db")## indexer_custom = DocstringIndexer(# repo=repo,# summarizer=summarizer,# embed_fn=my_custom_embed_fn,# backend=custom_backend,# persist_dir="./my_custom_index_explicit_persist" # Can also be set directly on backend# )
Parameters:
repo
(Repository
, required): An instance ofkit.Repository
pointing to the codebase to be indexed.summarizer
(Summarizer
, required): An instance ofkit.Summarizer
used to generate summaries for code symbols or files.embed_fn
(Optional[Callable[[str], List[float]]]
, default:SentenceTransformer('all-MiniLM-L6-v2')
): A function that takes a string and returns its embedding (a list of floats). IfNone
, a default embedding function usingsentence-transformers
(all-MiniLM-L6-v2
model) will be used. Thesentence-transformers
package must be installed for the default to work (pip install sentence-transformers
).backend
(Optional[VectorDBBackend]
, default:ChromaDBBackend
): The vector database backend to use for storing and querying embeddings. IfNone
, aChromaDBBackend
instance will be created. The default collection name iskit_docstring_index
.persist_dir
(Optional[str]
, default:'./.kit_index/' + repo_name_slug + '/docstrings'
): The directory where the vector database (e.g., ChromaDB) should persist its data. IfNone
, a default path is constructed based on the repository name within a.kit_index
directory in the current working directory. If a custombackend
is provided, this parameter might be ignored if the backend itself has persistence configured. It’s primarily used for the defaultChromaDBBackend
if no explicitbackend
is given or if the default backend needs a specific persistence path.
Methods
Section titled “Methods”Method: DocstringIndexer.build
(defined in kit/docstring_indexer.py
)
Builds or rebuilds the docstring index. It iterates through files in the repository (respecting .gitignore
and file_extensions
), extracts symbols or uses whole file content based on the level
, generates summaries, embeds them, and adds them to the vector database. It also handles caching to avoid re-processing unchanged symbols/files.
# Build the index (symbol-level by default for .py files)indexer.build()
# Force a rebuild, ignoring any existing cacheindexer.build(force=True)
# Index at file level instead of symbol levelindexer.build(level="file")
# Index only specific file extensionsindexer.build(file_extensions=[".py", ".mdx"])
Parameters:
force
(bool
, default:False
): IfTrue
, the entire index is rebuilt, ignoring any existing cache and potentially overwriting existing data in the backend. IfFalse
, uses cached summaries/embeddings for unchanged code and only processes new/modified code. It also avoids re-initializing the backend if it already contains data, unless changes are detected.level
(str
, default:'symbol'
): The granularity of indexing.'symbol'
: Extracts and summarizes individual symbols (functions, classes, methods) from files.'file'
: Summarizes the entire content of each file.
file_extensions
(Optional[List[str]]
, default:None
(uses Repository’s default, typically .py)): A list of file extensions (e.g.,['.py', '.md']
) to include in the indexing process. IfNone
, uses the default behavior of theRepository
instance, which typically focuses on Python files but can be configured.
Returns: None
get_searcher
Section titled “get_searcher”Method: DocstringIndexer.get_searcher
(defined in kit/docstring_indexer.py
)
Returns a SummarySearcher
instance that is configured to query the index managed by this DocstringIndexer
.
This provides a convenient way to obtain a search interface after the indexer has been built or loaded, without needing to manually instantiate SummarySearcher
.
# Assuming 'indexer' is an initialized DocstringIndexer instance# indexer.build() # or it has been loaded with a pre-built index
search_interface = indexer.get_searcher()results = search_interface.search("my search query", top_k=3)
for result in results: print(result)
Parameters: None
Returns: SummarySearcher
An instance of SummarySearcher
linked to this indexer.