This document outlines the current capabilities of the kit
library and a potential roadmap for its future development. It’s a living document and will evolve as the project progresses.
kit
aims to be a comprehensive Python toolkit for advanced code understanding, analysis, and interaction, with a strong emphasis on leveraging Large Language Models (LLMs) where appropriate. It’s designed to be modular, extensible, and developer-friendly.
As of now, kit
provides the following core functionalities:
Repository Interaction
The Repository
class acts as a central hub for accessing various code analysis features for a given codebase.
Code Mapping & Symbols
RepoMapper
provides structural and symbol information from code files, using Tree-sitter for multi-language support and incremental updates.
Code Summarization
The Summarizer
class, supporting multiple LLM providers (e.g., OpenAI, Anthropic, Google), generates summaries for code files, functions, and classes.
Docstring Indexing & Search
The DocstringIndexer
generates and embeds AI-powered summaries (dynamic docstrings) for code elements. The SummarySearcher
queries this index for semantic understanding and retrieval based on code intent.
Code Search
Includes CodeSearcher
for literal/regex searches, and VectorSearcher
for semantic search on raw code embeddings. For semantic search on AI-generated summaries, see “Docstring Indexing & Search”.
LLM Context Building
LLMContext
helps in assembling relevant code snippets and information into effective prompts for LLMs.
Here are some areas we’re looking to improve and expand upon:
RepoMapper
& Symbol Extraction:
- Deeper Language Insights: Beyond basic symbol extraction, explore richer semantic information (e.g., variable types, function signatures in more detail).
- Custom Symbol Types: Allow users to define and extract custom symbol types relevant to their specific frameworks or DSLs.
- Robustness: Continue to improve
.gitignore
handling and parsing of various project structures.
- Performance: Optimize scanning for very large repositories.
CodeSearcher
:
- Full File Exclusion: Implement robust
.gitignore
and other ignore file pattern support.
- Advanced Search Options: Add features like whole-word matching, and consider more powerful query syntax.
- Performance: Explore integration with native search tools (e.g.,
ripgrep
) as an optional backend for speed.
VectorSearcher
(Semantic Search):
- Configurability: Offer more choices for embedding models, chunking strategies, and vector database backends for raw code embeddings.
- Hybrid Search: Explore combining keyword and semantic search for optimal results.
- Index Management: Tools for easier creation, updating, and inspection of semantic search indexes.
- Docstring Indexing & Search Enhancements:
- Explore advanced indexing strategies (e.g., hierarchical summaries, metadata filtering for summary search).
- Improve management and scalability of summary vector stores.
- Investigate hybrid search techniques combining summary semantics with keyword precision.
Summarizer
:
- Granular Summaries Refinement: Refine and expand granular summaries for functions and classes, ensuring broad language construct coverage and exploring different summary depths.
- Multi-LLM Support Expansion: Expand and standardize multi-LLM support, facilitating easier integration of new cloud providers, local models, and enhancing common configuration interfaces.
- Customizable Prompts: Allow users more control over the prompts used for summarization.
LLMContext
:
- Smarter Context Retrieval: Develop more sophisticated strategies for selecting the most relevant context for different LLM tasks (e.g., using call graphs, semantic similarity, and historical data).
- Token Optimization: Implement techniques to maximize information density within LLM token limits.
- Refactoring Tools: Leverage
kit
’s understanding of code to suggest or perform automated refactoring.
- Code Generation: Explore LLM-powered code generation based on existing codebase patterns or natural language descriptions.
- Documentation Generation: Automate the creation or updating of code documentation using
kit
’s analysis and LLM capabilities.
- Tree-sitter Queries: Continuously expand and refine Tree-sitter queries for robust support across more programming languages and to address specific parsing challenges (e.g., HCL resource extraction noted previously).
- Framework Awareness: Develop extensions or plugins that provide specialized understanding for popular frameworks (e.g., Django, React, Spring).
- Comprehensive Testing: Ensure high test coverage for all modules and functionalities.
- Documentation: Maintain high-quality, up-to-date documentation, including API references, tutorials, and practical recipes.
- CLI Development: Develop a more feature-rich and user-friendly command-line interface for common
kit
operations.
- IDE Integration: Explore possibilities for integrating
kit
’s features into popular IDEs via plugins, MPC, or Language Server Protocol (LSP) extensions.
- REST API Service: Develop a comprehensive REST API service to make
kit
’s capabilities accessible to non-Python users and applications. This would allow developers using any programming language to leverage kit
’s code intelligence features through standard HTTP requests.
- REST API & Service Layer: Expand the REST API service to provide comprehensive access to all
kit
features:
- Containerized Deployment: Provide Docker images and deployment templates for easy self-hosting.
- Client Libraries: Develop official client libraries for popular languages (TypeScript, Go, Rust) to interact with the
kit
API.
- Authentication & Multi-User Support: Implement secure authentication and multi-user capabilities for shared deployments.
- Webhooks & Events: Support webhook integrations for code events and analysis results.
- Plugin Architecture: Design
kit
with a clear plugin architecture to allow the community to easily add new languages, analysis tools, or LLM integrations.
This roadmap is ambitious, and priorities will be adjusted based on user feedback and development progress.