Codebase Q&A Bot with Summaries
This tutorial demonstrates how to build a simple question-answering bot for your codebase. The bot will:
- Use
DocstringIndexer
to create semantic summaries of each file. - When a user asks a question, use
SummarySearcher
to find relevant file summaries. - Fetch the full code of those top files.
- Use
ContextAssembler
to build a concise prompt for an LLM. - Get an answer from the LLM.
This approach is powerful because it combines semantic understanding (from summaries) with the full detail of the source code, allowing an LLM to answer nuanced questions.
Prerequisites
Section titled “Prerequisites”- You have
kit
installed (pip install cased-kit
). - You have an OpenAI API key set (
export OPENAI_API_KEY=...
). - You have a local Git repository you want to query.
-
Initialize Components
First, let’s set up our
Repository
,DocstringIndexer
,Summarizer
(for the indexer),SummarySearcher
, andContextAssembler
.from kit import Repository, DocstringIndexer, Summarizer, SummarySearcher, ContextAssemblerfrom kit.summaries import OpenAIConfig# --- Configuration ---REPO_PATH = "/path/to/your/local/git/repo" #! MODIFY# For DocstringIndexer, persist_dir is where the DB is stored.# Let's use a directory for ChromaDB as it might create multiple files.INDEX_PERSIST_DIR = "./my_code_qa_index_db/"# Use a specific summarizer model for indexing, can be different from Q&A LLMINDEXER_LLM_CONFIG = OpenAIConfig(model="gpt-4o")# LLM for answering the question based on contextQA_LLM_CONFIG = OpenAIConfig(model="gpt-4o") # Or your preferred model# MAX_CONTEXT_CHARS is not directly used by ContextAssembler in this simplified flow# TOP_K_SUMMARIES = 3 remains relevant for SummarySearcherTOP_K_SUMMARIES = 3# --- END Configuration ---repo = Repository(REPO_PATH)# For DocstringIndexer - requires repo and a summarizer instancesummarizer_for_indexing = Summarizer(repo=repo, config=INDEXER_LLM_CONFIG)indexer = DocstringIndexer(repo, summarizer_for_indexing, persist_dir=INDEX_PERSIST_DIR)# For SummarySearcher - get it from the indexersearcher = indexer.get_searcher()# For assembling context for the Q&A LLMassembler = ContextAssembler(repo)# We'll need an LLM client to ask the final questionqa_llm_client = Summarizer(repo=repo, config=QA_LLM_CONFIG)._get_llm_client()print("Components initialized.")Make sure to replace
"/path/to/your/local/git/repo"
with the actual path to your repository. Also ensure the directory forINDEX_PERSIST_DIR
(e.g.,my_code_qa_index_db/
) can be created. -
Build or Load the Index
The
DocstringIndexer
needs to process your repository to create summaries and embed them. This can take time for large repositories. We’ll check if an index already exists and build it if not.import os# Check based on persist_dir for the indexerif not os.path.exists(INDEX_PERSIST_DIR) or not any(os.scandir(INDEX_PERSIST_DIR)):print(f"Index not found or empty at {INDEX_PERSIST_DIR}. Building...")# Build a symbol-level index for more granular results# force=True will rebuild if the directory exists but is perhaps from an old runindexer.build(level="symbol", file_extensions=[".py", ".js", ".md"], force=True)print("Symbol-level index built successfully.")else:print(f"Found existing index at {INDEX_PERSIST_DIR}.") -
Define the Question-Answering Function
This function will orchestrate the search, context assembly, and LLM query.
def answer_question(user_query: str) -> str:print(f"\nSearching for files/symbols relevant to: '{user_query}'")# 1. Search for relevant file/symbol summariessearch_results = searcher.search(user_query, top_k=TOP_K_SUMMARIES)if not search_results:return "I couldn't find any relevant files or symbols in the codebase to answer your question."print(f"Found {len(search_results)} relevant document summaries.")# Reset assembler for each new question to start with fresh contextcurrent_question_assembler = ContextAssembler(repo)# 2. Add relevant context to the assembleradded_content_identifiers = set() # To avoid adding the same file multiple times if symbols from it are retrievedfor i, res in enumerate(search_results):file_path = res.get('file_path')identifier_for_log = file_pathif res.get('level') == 'symbol':symbol_name = res.get('symbol_name', 'Unknown Symbol')symbol_type = res.get('symbol_type', 'Unknown Type')identifier_for_log = f"Symbol: {symbol_name} in {file_path} (Type: {symbol_type})"print(f" {i+1}. {identifier_for_log} (Score: {res.get('score', 0.0):.4f})")# For simplicity, add the full file content for any relevant file found,# whether the hit was file-level or symbol-level.# A more advanced version could add specific symbol code using a custom method.if file_path and file_path not in added_content_identifiers:try:# Add full file content for contextcurrent_question_assembler.add_file(file_path)added_content_identifiers.add(file_path)print(f" Added content of {file_path} to context.")except FileNotFoundError:print(f" Warning: File {file_path} not found when trying to add to context.")except Exception as e:print(f" Warning: Error adding {file_path} to context: {e}")if not added_content_identifiers:return "Found relevant file/symbol names, but could not retrieve their content for context."# 3. Get the assembled context stringprompt_context = current_question_assembler.format_context()if not prompt_context.strip():return "Could not assemble any context for the LLM based on search results."# 4. Formulate the prompt and ask the LLMsystem_message = ("You are a helpful AI assistant with expertise in the provided codebase. ""Answer the user's question based *only* on the following code context. ""If the answer is not found in the context, say so. Be concise.")final_prompt = f"## Code Context:\n\n{prompt_context}\n\n## User Question:\n\n{user_query}\n\n## Answer:"print("\nSending request to LLM...")# Assuming OpenAI client for this example structure# Adapt if using Anthropic or Googleif isinstance(QA_LLM_CONFIG, OpenAIConfig):response = qa_llm_client.chat.completions.create(model=QA_LLM_CONFIG.model,messages=[{"role": "system", "content": system_message},{"role": "user", "content": final_prompt}])answer = response.choices[0].message.content# Add elif for AnthropicConfig, GoogleConfig if desired, or abstract furtherelse:# Simplified fallback or placeholder for other LLMs# In a real app, you'd implement the specific API calls hereraise NotImplementedError(f"LLM client for {type(QA_LLM_CONFIG)} not fully implemented in this example.")return answer -
Ask a Question!
Now, let’s try it out.
my_question = "How does the authentication middleware handle expired JWTs?"# Or try: "What's the main purpose of the UserNotifications class's send_email method?"# Or: "Where is the database connection retry logic implemented in the db_utils module?"llm_answer = answer_question(my_question)print(f"\nLLM's Answer:\n{llm_answer}")Example Output (will vary based on your repo & LLM) Components initialized.Found existing index at ./my_code_qa_index_db/.Searching for files/symbols relevant to: 'How does the authentication middleware handle expired JWTs?'Found 3 relevant document summaries.1. Symbol: authenticate in src/auth/middleware.py (Type: function, Score: 0.8765)2. File: src/utils/jwt_helpers.py (Score: 0.7912)3. File: tests/auth/test_middleware.py (Score: 0.7500)Sending request to LLM...LLM's Answer:The `authenticate` function in `src/auth/middleware.py` checks for JWT expiration. If an `ExpiredSignatureError` is caught during token decoding (likely using a helper from `src/utils/jwt_helpers.py`), it returns a 401 Unauthorized response, typically with a JSON body like `{"error": "Token expired"}`.