Docstring-based Vector Indexing
DocstringIndexer
builds a vector index using LLM-generated summaries of
source files (“docstrings”) instead of the raw code. This often yields more
relevant results because the embedded text focuses on intent rather than
syntax or specific variable names.
Why use it?
Section titled “Why use it?”- Cleaner embeddings – Comments like “Summary of retry logic” embed better
than nested
for
-loops. - Smaller index – One summary per file (or symbol) is < 1 kB, while the file itself might be thousands of tokens.
- Provider-agnostic – Works with any LLM supported by
kit.Summarizer
(OpenAI, Anthropic, Google…).
How it Works
Section titled “How it Works”- Configuration: Instantiate
DocstringIndexer
with aRepository
object and aSummarizer
(configured with your desired LLM, e.g., OpenAI, Anthropic, Google). An embedding function (embed_fn
) can also be provided if you wish to use a custom embedding model; otherwise,DocstringIndexer
will use a default embedding function (based onsentence-transformers
, which is included in the standard installation).
from kit import Repository, DocstringIndexer, Summarizerfrom kit.llms.openai import OpenAIConfig # For configuring the summarization LLM
# 1. Initialize your Repositoryrepo = Repository("/path/to/your/codebase")
# 2. Configure and initialize the Summarizer# It's good practice to specify the model you want for summarization.# Summarizer defaults to OpenAIConfig() if no config is passed, which then# might use environment variables (OPENAI_MODEL) or a default model from OpenAIConfig.llm_summarizer_config = OpenAIConfig(model="gpt-4o") # Or "gpt-4-turbo", etc.summarizer = Summarizer(repo, config=llm_summarizer_config)
# 3. Initialize DocstringIndexer# By default, DocstringIndexer now uses SentenceTransformer('all-MiniLM-L6-v2')# for embeddings, so you don't need to provide an embed_fn for basic usage.indexer = DocstringIndexer(repo, summarizer)
# 4. Build the index# This will process the repository, generate summaries, and create embeddings.indexer.build()
# After building, you can query the index using a SummarySearcher.
# Option 1: Manually create a SummarySearcher (traditional way)# from kit import SummarySearcher# searcher_manual = SummarySearcher(indexer)
# Option 2: Use the convenient get_searcher() method (recommended)searcher = indexer.get_searcher()
# Now you can use the searcherresults = searcher.search("your query here", top_k=3)for result in results: print(f"Found: {result.get('metadata', {}).get('file_path')}::{result.get('metadata', {}).get('symbol_name')}") print(f"Summary: {result.get('metadata', {}).get('summary')}") print(f"Score: {result.get('score')}") print("---")
Using a Custom Embedding Function (Optional)
Section titled “Using a Custom Embedding Function (Optional)”If you want to use a different embedding model or a custom embedding function, you can pass it to the DocstringIndexer
during initialization. The function should take a string as input and return a list of floats (the embedding vector).
For example, if you wanted to use a different model from the sentence-transformers
library:
from kit import Repository, DocstringIndexer, Summarizerfrom kit.llms.openai import OpenAIConfigfrom sentence_transformers import SentenceTransformer # Make sure you have this installed
repo = Repository("/path/to/your/codebase")llm_summarizer_config = OpenAIConfig(model="gpt-4o")summarizer = Summarizer(repo, config=llm_summarizer_config)
# Load a specific sentence-transformer model# You can find available models at https://www.sbert.net/docs/pretrained_models.htmlcustom_st_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
def custom_embedding_function(text_to_embed: str): embedding_vector = custom_st_model.encode(text_to_embed) return embedding_vector.tolist()
# Initialize DocstringIndexer with your custom embedding functionindexer_custom = DocstringIndexer(repo, summarizer, embed_fn=custom_embedding_function)
indexer_custom.build()
This approach gives you flexibility if the default embedding model doesn’t meet your specific needs.
Inspecting the Indexed Data (Optional)
Section titled “Inspecting the Indexed Data (Optional)”After building the index, you might want to inspect its raw contents to understand what was stored. This can be useful for debugging or exploration. The exact method depends on the VectorDBBackend
being used.
If you’re using the default ChromaDBBackend
(or have explicitly configured it), you can access the underlying ChromaDB collection and retrieve entries.
# Assuming 'indexer' is your DocstringIndexer instance after 'indexer.build()' has run.# And 'indexer.backend' is an instance of ChromaDBBackend.
if hasattr(indexer.backend, 'collection'): chroma_collection = indexer.backend.collection print(f"Inspecting ChromaDB collection: {chroma_collection.name}") print(f"Number of items: {chroma_collection.count()}")
# Retrieve the first few items (e.g., 3) # We include 'metadatas' and 'documents' (which holds the summary text). # 'embeddings' are excluded for brevity. retrieved_data = chroma_collection.get( limit=3, include=['metadatas', 'documents'] )
if retrieved_data and retrieved_data.get('ids'): for i in range(len(retrieved_data['ids'])): item_id = retrieved_data['ids'][i] # The 'document' is the summary text that was embedded. summary_text = retrieved_data['documents'][i] if retrieved_data['documents'] else "N/A" # 'metadata' contains file_path, symbol_name, original summary, etc. metadata = retrieved_data['metadatas'][i] if retrieved_data['metadatas'] else {}
print(f"\n--- Item {i+1} ---") print(f" ID (in Chroma): {item_id}") print(f" Stored Summary (Document): {summary_text}") print(f" Metadata:") for key, value in metadata.items(): print(f" {key}: {value}") else: print("No items found in the collection or collection is empty.")else: print("The configured backend does not seem to be ChromaDB or doesn't expose a 'collection' attribute for direct inspection this way.")
Expected Output from Inspection:
Running the inspection code above might produce output like this:
Inspecting ChromaDB collection: kit_docstring_indexNumber of items: 10 # Or however many items are in your test repo index
--- Item 1 --- ID (in Chroma): utils.py::greet Stored Summary (Document): The `greet` function in the `utils.py` file is designed to generate a friendly greeting message... Metadata: file_path: utils.py level: symbol summary: The `greet` function in the `utils.py` file is designed to generate a friendly greeting message... symbol_name: greet symbol_type: function
--- Item 2 --- ID (in Chroma): app.py::main Stored Summary (Document): The `main` function in `app.py` demonstrates a simple authentication workflow... Metadata: file_path: app.py level: symbol summary: The `main` function in `app.py` demonstrates a simple authentication workflow... symbol_name: main symbol_type: function
... (and so on)
This shows that each entry in the ChromaDB collection has:
- An
id
(oftenfile_path::symbol_name
). - The
document
field, which is the text of the summary that was embedded. metadata
containing details likefile_path
,symbol_name
,symbol_type
,level
, and often a redundant copy of thesummary
itself.
Knowing the structure of this stored data can be very helpful when working with search results or debugging the indexing process.
Symbol-Level Indexing
Section titled “Symbol-Level Indexing”For more granular search, you can instruct DocstringIndexer
to create summaries for individual functions and classes within your files. This allows for highly specific semantic queries like “find the class that manages database connections” or “what function handles user authentication?”
To enable symbol-level indexing, pass level="symbol"
to build()
:
# Build a symbol-level indexindexer.build(level="symbol", file_extensions=[".py"], force=True)
When level="symbol"
:
DocstringIndexer
iterates through files, then extracts symbols (functions, classes) from each file usingrepo.extract_symbols()
.- It then calls
summarizer.summarize_function()
orsummarizer.summarize_class()
for each symbol. - The resulting embeddings are stored with metadata including:
file_path
: The path to the file containing the symbol.symbol_name
: The name of the function or class (e.g.,my_function
,MyClass
,MyClass.my_method
).symbol_type
: The type of symbol (e.g., “FUNCTION”, “CLASS”, “METHOD”).summary
: The LLM-generated summary of the symbol.level
: Set to"symbol"
.
-
Querying: Use
SummarySearcher
to find relevant summaries.from kit import SummarySearchersearcher = SummarySearcher(indexer) # Pass the built indexerresults = searcher.search("user authentication logic", top_k=3)for res in results:print(f"Score: {res['score']:.4f}")if res.get('level') == 'symbol':print(f" Symbol: {res['symbol_name']} ({res['symbol_type']}) in {res['file_path']}")else:print(f" File: {res['file_path']}")print(f" Summary: {res['summary'][:100]}...")print("---")The
results
will contain the summary and associated metadata, including thelevel
and symbol details if applicable.
Quick start
Section titled “Quick start”import kitfrom sentence_transformers import SentenceTransformer
repo = kit.Repository("/path/to/your/project")
# 1. LLM summarizer (make sure OPENAI_API_KEY / etc. is set)summarizer = repo.get_summarizer()
# 2. Embedding function (any model that returns list[float])embed_model = SentenceTransformer("all-MiniLM-L6-v2")embed_fn = lambda txt: embed_model.encode(txt).tolist()
# 3. Build the index (stored in .kit/docstring_db)indexer = kit.DocstringIndexer(repo, summarizer)indexer.build()
# 4. Searchsearcher = kit.SummarySearcher(indexer)for hit in searcher.search("retry back-off", top_k=5): print(hit["file"], "→", hit["summary"])
Storage details
Section titled “Storage details”DocstringIndexer
delegates persistence to any
kit.vector_searcher.VectorDBBackend
. The default backend is
Chroma
and lives in
.kit/docstring_db/
inside your repo.
Use Cases
Section titled “Use Cases”- Semantic Code Search: Find code by describing what it does, not just
what keywords it contains. (e.g., “retry back-off logic” instead of trying to
guess variable names like
exponential_delay
orMAX_RETRIES
). - Onboarding: Quickly understand what different parts of a codebase are for.
- Automated Documentation: Use the summaries as a starting point for API docs.
- Codebase Q&A: As shown in the Codebase Q&A Bot tutorial, combine
SummarySearcher
with an LLM to answer questions about your code, using summaries to find relevant context at either the file or symbol level.
API reference
Section titled “API reference”Check docs for DocstringIndexer
and SummarySearcher
for full signatures.