Docstring-based Vector Indexing

DocstringIndexer builds a vector index using LLM-generated summaries of source files (“docstrings”) instead of the raw code. This often yields more relevant results because the embedded text focuses on intent rather than syntax or specific variable names.

Why use it?

Cleaner embeddings – Comments like “Summary of retry logic” embed better than nested for-loops.
Smaller index – One summary per file (or symbol) is < 1 kB, while the file itself might be thousands of tokens.
Provider-agnostic – Works with any LLM supported by kit.Summarizer (OpenAI, Anthropic, Google…).

How it Works

Configuration: Instantiate DocstringIndexer with a Repository object and a Summarizer (configured with your desired LLM, e.g., OpenAI, Anthropic, Google). An embedding function (embed_fn) can also be provided if you wish to use a custom embedding model; otherwise, DocstringIndexer will use a default embedding function (based on sentence-transformers, which is included in the standard installation).

from kit import Repository, DocstringIndexer, Summarizer
from kit.llms.openai import OpenAIConfig # For configuring the summarization LLM

# 1. Initialize your Repository
repo = Repository("/path/to/your/codebase")

# 2. Configure and initialize the Summarizer
# It's good practice to specify the model you want for summarization.
# Summarizer defaults to OpenAIConfig() if no config is passed, which then
# might use environment variables (OPENAI_MODEL) or a default model from OpenAIConfig.
llm_summarizer_config = OpenAIConfig(model="gpt-4o") # Or "gpt-4-turbo", etc.
summarizer = Summarizer(repo, config=llm_summarizer_config)

# 3. Initialize DocstringIndexer
# By default, DocstringIndexer now uses SentenceTransformer('all-MiniLM-L6-v2')
# for embeddings, so you don't need to provide an embed_fn for basic usage.
indexer = DocstringIndexer(repo, summarizer)

# 4. Build the index
# This will process the repository, generate summaries, and create embeddings.
indexer.build()

# After building, you can query the index using a SummarySearcher.

# Option 1: Manually create a SummarySearcher (traditional way)
# from kit import SummarySearcher
# searcher_manual = SummarySearcher(indexer)

# Option 2: Use the convenient get_searcher() method (recommended)
searcher = indexer.get_searcher()

# Now you can use the searcher
results = searcher.search("your query here", top_k=3)
for result in results:
    print(f"Found: {result.get('metadata', {}).get('file_path')}::{result.get('metadata', {}).get('symbol_name')}")
    print(f"Summary: {result.get('metadata', {}).get('summary')}")
    print(f"Score: {result.get('score')}")
    print("---")

Using a Custom Embedding Function (Optional)

If you want to use a different embedding model or a custom embedding function, you can pass it to the DocstringIndexer during initialization. The function should take a string as input and return a list of floats (the embedding vector).

For example, if you wanted to use a different model from the sentence-transformers library:

from kit import Repository, DocstringIndexer, Summarizer
from kit.llms.openai import OpenAIConfig
from sentence_transformers import SentenceTransformer # Make sure you have this installed

repo = Repository("/path/to/your/codebase")
llm_summarizer_config = OpenAIConfig(model="gpt-4o")
summarizer = Summarizer(repo, config=llm_summarizer_config)

# Load a specific sentence-transformer model
# You can find available models at https://www.sbert.net/docs/pretrained_models.html
custom_st_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

def custom_embedding_function(text_to_embed: str):
    embedding_vector = custom_st_model.encode(text_to_embed)
    return embedding_vector.tolist()

# Initialize DocstringIndexer with your custom embedding function
indexer_custom = DocstringIndexer(repo, summarizer, embed_fn=custom_embedding_function)

indexer_custom.build()

This approach gives you flexibility if the default embedding model doesn’t meet your specific needs.

Inspecting the Indexed Data (Optional)

After building the index, you might want to inspect its raw contents to understand what was stored. This can be useful for debugging or exploration. The exact method depends on the VectorDBBackend being used.

If you’re using the default ChromaDBBackend (or have explicitly configured it), you can access the underlying ChromaDB collection and retrieve entries.

# Assuming 'indexer' is your DocstringIndexer instance after 'indexer.build()' has run.
# And 'indexer.backend' is an instance of ChromaDBBackend.

if hasattr(indexer.backend, 'collection'):
    chroma_collection = indexer.backend.collection
    print(f"Inspecting ChromaDB collection: {chroma_collection.name}")
    print(f"Number of items: {chroma_collection.count()}")

    # Retrieve the first few items (e.g., 3)
    # We include 'metadatas' and 'documents' (which holds the summary text).
    # 'embeddings' are excluded for brevity.
    retrieved_data = chroma_collection.get(
        limit=3,
        include=['metadatas', 'documents']
    )

    if retrieved_data and retrieved_data.get('ids'):
        for i in range(len(retrieved_data['ids'])):
            item_id = retrieved_data['ids'][i]
            # The 'document' is the summary text that was embedded.
            summary_text = retrieved_data['documents'][i] if retrieved_data['documents'] else "N/A"
            # 'metadata' contains file_path, symbol_name, original summary, etc.
            metadata = retrieved_data['metadatas'][i] if retrieved_data['metadatas'] else {}

            print(f"\n--- Item {i+1} ---")
            print(f"  ID (in Chroma): {item_id}")
            print(f"  Stored Summary (Document): {summary_text}")
            print(f"  Metadata:")
            for key, value in metadata.items():
                print(f"    {key}: {value}")
    else:
        print("No items found in the collection or collection is empty.")
else:
    print("The configured backend does not seem to be ChromaDB or doesn't expose a 'collection' attribute for direct inspection this way.")

Expected Output from Inspection:

Running the inspection code above might produce output like this:

Inspecting ChromaDB collection: kit_docstring_index
Number of items: 10 # Or however many items are in your test repo index

--- Item 1 ---
  ID (in Chroma): utils.py::greet
  Stored Summary (Document): The `greet` function in the `utils.py` file is designed to generate a friendly greeting message...
  Metadata:
    file_path: utils.py
    level: symbol
    summary: The `greet` function in the `utils.py` file is designed to generate a friendly greeting message...
    symbol_name: greet
    symbol_type: function

--- Item 2 ---
  ID (in Chroma): app.py::main
  Stored Summary (Document): The `main` function in `app.py` demonstrates a simple authentication workflow...
  Metadata:
    file_path: app.py
    level: symbol
    summary: The `main` function in `app.py` demonstrates a simple authentication workflow...
    symbol_name: main
    symbol_type: function

... (and so on)

This shows that each entry in the ChromaDB collection has:

An id (often file_path::symbol_name).
The document field, which is the text of the summary that was embedded.
metadata containing details like file_path, symbol_name, symbol_type, level, and often a redundant copy of the summary itself.

Knowing the structure of this stored data can be very helpful when working with search results or debugging the indexing process.

Symbol-Level Indexing

For more granular search, you can instruct DocstringIndexer to create summaries for individual functions and classes within your files. This allows for highly specific semantic queries like “find the class that manages database connections” or “what function handles user authentication?”

To enable symbol-level indexing, pass level="symbol" to build():

# Build a symbol-level index
indexer.build(level="symbol", file_extensions=[".py"], force=True)

When level="symbol":

DocstringIndexer iterates through files, then extracts symbols (functions, classes) from each file using repo.extract_symbols().
It then calls summarizer.summarize_function() or summarizer.summarize_class() for each symbol.
The resulting embeddings are stored with metadata including:
- file_path: The path to the file containing the symbol.
- symbol_name: The name of the function or class (e.g., my_function, MyClass, MyClass.my_method).
- symbol_type: The type of symbol (e.g., “FUNCTION”, “CLASS”, “METHOD”).
- summary: The LLM-generated summary of the symbol.
- level: Set to "symbol".

Querying: Use SummarySearcher to find relevant summaries.

from kit import SummarySearcher

searcher = SummarySearcher(indexer) # Pass the built indexer
results = searcher.search("user authentication logic", top_k=3)

for res in results:
    print(f"Score: {res['score']:.4f}")
    if res.get('level') == 'symbol':
        print(f"  Symbol: {res['symbol_name']} ({res['symbol_type']}) in {res['file_path']}")
    else:
        print(f"  File: {res['file_path']}")
    print(f"  Summary: {res['summary'][:100]}...")
    print("---")

The results will contain the summary and associated metadata, including the level and symbol details if applicable.

Quick start

import kit
from sentence_transformers import SentenceTransformer

repo = kit.Repository("/path/to/your/project")

# 1. LLM summarizer (make sure OPENAI_API_KEY / etc. is set)
summarizer = repo.get_summarizer()

# 2. Embedding function (any model that returns list[float])
embed_model = SentenceTransformer("all-MiniLM-L6-v2")
embed_fn = lambda txt: embed_model.encode(txt).tolist()

# 3. Build the index (stored in .kit/docstring_db)
indexer = kit.DocstringIndexer(repo, summarizer)
indexer.build()

# 4. Search
searcher = kit.SummarySearcher(indexer)
for hit in searcher.search("retry back-off", top_k=5):
    print(hit["file"], "→", hit["summary"])

Storage details

DocstringIndexer delegates persistence to any kit.vector_searcher.VectorDBBackend. The default backend is Chroma and lives in .kit/docstring_db/ inside your repo.

Use Cases

Semantic Code Search: Find code by describing what it does, not just what keywords it contains. (e.g., “retry back-off logic” instead of trying to guess variable names like exponential_delay or MAX_RETRIES).
Onboarding: Quickly understand what different parts of a codebase are for.
Automated Documentation: Use the summaries as a starting point for API docs.
Codebase Q&A: As shown in the Codebase Q&A Bot tutorial, combine SummarySearcher with an LLM to answer questions about your code, using summaries to find relevant context at either the file or symbol level.

API reference

Check docs for DocstringIndexer and SummarySearcher for full signatures.