Repository API

The Repository class is the main entry point for analyzing code repositories with Kit.

Core Methods

`get_file_tree(subpath=None)`

Returns the file tree structure of the repository.

Parameters:

subpath (str, optional): Subdirectory path relative to repo root. If None, returns entire repo tree. If specified, returns tree starting from that subdirectory.

Returns:

List[Dict[str, Any]]: List of file/directory objects with path, is_dir, name, and size fields.

Examples:

from kit import Repository

repo = Repository("/path/to/repo")

# Get full repository tree
full_tree = repo.get_file_tree()

# Get tree for specific subdirectory
src_tree = repo.get_file_tree(subpath="src")
components_tree = repo.get_file_tree(subpath="src/components")

`grep(pattern, *, case_sensitive=True, include_pattern=None, exclude_pattern=None, max_results=1000)`

Performs literal grep search on repository files using system grep.

Parameters:

pattern (str): The literal string to search for (not a regex)
case_sensitive (bool): Whether the search should be case sensitive. Defaults to True
include_pattern (str, optional): Glob pattern for files to include (e.g. ‘*.py’)
exclude_pattern (str, optional): Glob pattern for files to exclude
max_results (int): Maximum number of results to return. Defaults to 1000

Returns:

List[Dict[str, Any]]: List of matches with file, line_number, and line_content fields

Examples:

from kit import Repository

repo = Repository("/path/to/repo")

# Basic search
matches = repo.grep("TODO")

# Case insensitive search
matches = repo.grep("function", case_sensitive=False)

# Search only Python files
matches = repo.grep("class", include_pattern="*.py")

# Exclude test files
matches = repo.grep("import", exclude_pattern="*test*")

# Limit results
matches = repo.grep("error", max_results=50)

Note: Requires system grep command to be available in PATH.

`extract_symbols(file_path=None)`

Extracts symbols (functions, classes, etc.) from repository files.

Parameters:

file_path (str, optional): Specific file to extract from. If None, extracts from all files.

Returns:

List[Dict[str, Any]]: List of symbol objects with name, type, start_line, end_line, and code fields.

`search_text(query, file_pattern="*")`

Searches for text patterns in repository files using Kit’s built-in search.

Parameters:

query (str): Text or regex pattern to search for
file_pattern (str): Glob pattern for files to search. Defaults to ”*”

Returns:

List[Dict[str, Any]]: List of search results

`get_file_content(file_path)`

Reads and returns file content(s).

Parameters:

file_path (str | List[str]): Single file path or list of file paths

Returns:

str | Dict[str, str]: File content (single) or mapping of path → content (multiple)

Properties

`current_sha`

Current git commit SHA (full).

`current_sha_short`

Current git commit SHA (short).

`current_branch`

Current git branch name.

`remote_url`

Git remote URL.

`is_dirty`

Whether the working directory has uncommitted changes.

Utility Methods

`get_abs_path(relative_path)`

Converts relative path to absolute path within repository.

`index()`

Returns comprehensive repository index with file tree and symbols.

`write_symbols(file_path, symbols=None)`

Writes symbols to JSON file.

`write_file_tree(file_path)`

Writes file tree to JSON file.

`find_symbol_usages(symbol_name, symbol_type=None)`

Finds all usages of a symbol across the repository.

Examples

Basic Repository Analysis

from kit import Repository

# Open repository
repo = Repository("/path/to/project")

# Get repository structure
tree = repo.get_file_tree()
print(f"Found {len(tree)} files and directories")

# Extract all symbols
symbols = repo.extract_symbols()
print(f"Found {len(symbols)} symbols")

# Search for TODOs
todos = repo.grep("TODO", case_sensitive=False)
print(f"Found {len(todos)} TODO items")

Working with Subdirectories

# Analyze only source code directory
src_tree = repo.get_file_tree(subpath="src")
src_symbols = repo.extract_symbols("src/main.py")

# Search within specific directory
api_todos = repo.grep("TODO", include_pattern="src/api/*.py")

File Content Operations

# Single file
content = repo.get_file_content("README.md")

# Multiple files
contents = repo.get_file_content(["src/main.py", "src/utils.py"])
for file_path, content in contents.items():
    print(f"{file_path}: {len(content)} characters")

Initialization

from kit import Repository

# Initialize with a local path
local_repo = Repository("/path/to/your/local/project")

# Initialize with a remote URL (requires git)
# remote_repo = Repository("https://github.com/user/repo.git")

# Initialize at a specific commit, tag, or branch
# versioned_repo = Repository("https://github.com/user/repo.git", ref="v1.2.3")

Incremental Analysis Methods

extract_symbols_incremental(): High-performance symbol extraction with caching.
get_incremental_stats(): Get cache performance statistics.
cleanup_incremental_cache(): Remove stale cache entries.
clear_incremental_cache(): Clear all cached data.

Creating a `Repository` Instance

To start using kit, first create an instance of the Repository class. This points kit to the codebase you want to analyze.

from kit import Repository

# For a local directory
repository_instance = Repository(path_or_url="/path/to/local/project")

# For a remote Git repository (public or private)
# repository_instance = Repository(
#     path_or_url="https://github.com/owner/repo-name",
#     github_token="YOUR_GITHUB_TOKEN",  # Optional: For private repos
#     cache_dir="/path/to/cache",        # Optional: For caching clones
#     ref="v1.2.3",                       # Optional: Specific commit, tag, or branch
#     cache_ttl_hours=2,                  # Optional: New in v1.2.0
# )

Parameters:

path_or_url (str): The path to a local directory or the URL of a remote Git repository.
github_token (Optional[str]): A GitHub personal access token required for cloning private repositories. If not provided, automatically checks the KIT_GITHUB_TOKEN and GITHUB_TOKEN environment variables. Defaults to None.
cache_dir (Optional[str]): Path to a directory for caching cloned repositories. Defaults to a system temporary directory.
ref (Optional[str]): Git reference (commit SHA, tag, or branch) to checkout. For remote repositories, this determines which version to clone. For local repositories, this will checkout the specified ref. Defaults to None.
cache_ttl_hours (Optional[float]): New in v1.2.0. When cloning a remote repository, Kit stores the clone under tmp/kit-repo-cache (or your custom cache_dir). On the first clone in each Python process Kit will delete any cached repo directory whose modification time is older than this many hours before continuing. Pass None (default) or 0 to disable the cleanup. You can also set the global environment variable KIT_TMP_REPO_TTL_HOURS to apply the policy process-wide without changing code.

Once you have a repository object, you can call the following methods on it:

`repository.get_file_tree()`

Returns the file tree structure of the repository.

repository.get_file_tree() -> List[Dict[str, Any]]

Returns:

List[Dict[str, Any]]: A list of dictionaries, where each dictionary represents a file or directory with keys like path, name, is_dir, size.

`repository.get_file_content()`

Reads and returns the content of a specified file within the repository as a string.

repository.get_file_content(file_path: str) -> str

Parameters:

file_path (str): The path to the file, relative to the repository root.

Returns:

str: The content of the file.

Raises:

FileNotFoundError: If the file does not exist at the specified path.
IOError: If any other I/O error occurs during file reading.

`repository.extract_symbols()`

Extracts code symbols (functions, classes, variables, etc.) from the repository.

repository.extract_symbols(file_path: Optional[str] = None) -> List[Dict[str, Any]]

Parameters:

file_path (Optional[str]): If provided, extracts symbols only from this specific file path relative to the repo root. If None, extracts symbols from all supported files in the repository. Defaults to None.

Returns:

List[Dict[str, Any]]: A list of dictionaries, each representing a symbol with keys like name, type, file, line_start, line_end, code.

`repository.extract_symbols_incremental()`

Extracts code symbols with intelligent caching for dramatically improved performance. Uses multiple invalidation strategies (mtime, size, content hash, git state) to ensure accuracy while providing 25-36x speedups for warm cache scenarios.

repository.extract_symbols_incremental(file_path: Optional[str] = None) -> List[Dict[str, Any]]

Parameters:

file_path (Optional[str]): If provided, extracts symbols only from this specific file path relative to the repo root. If None, extracts symbols from all supported files in the repository. Defaults to None.

Returns:

List[Dict[str, Any]]: A list of dictionaries, each representing a symbol with keys like name, type, file, line_start, line_end, code.

Performance:

Cold cache: Full analysis with cache building
Warm cache: 25x faster using cached results
Automatic invalidation: Cache invalidated on git state changes, file modifications

Example:

# First call builds cache
symbols = repository.extract_symbols_incremental()
print(f"Found {len(symbols)} symbols")

# Subsequent calls use cache (much faster)
symbols = repository.extract_symbols_incremental()
print(f"Found {len(symbols)} symbols (cached)")

# Check performance
stats = repository.get_incremental_stats()
print(f"Cache hit rate: {stats['cache_hit_rate']}")

`repository.get_incremental_stats()`

Returns performance statistics for the incremental analysis system.

repository.get_incremental_stats() -> Dict[str, Any]

Returns:

Dict[str, Any]: Statistics dictionary containing:
- cache_hit_rate: Percentage of cache hits (0.0-1.0)
- files_analyzed: Number of files analyzed in last operation
- cache_hits: Total cache hits
- cache_misses: Total cache misses
- avg_analysis_time: Average time per file analysis
- cache_size_mb: Cache size in megabytes

Example:

stats = repository.get_incremental_stats()
print(f"Cache hit rate: {stats['cache_hit_rate']:.1%}")
print(f"Files analyzed: {stats['files_analyzed']}")
print(f"Cache size: {stats['cache_size_mb']:.1f}MB")

`repository.cleanup_incremental_cache()`

Removes stale cache entries for files that no longer exist in the repository.

repository.cleanup_incremental_cache() -> int

Returns:

int: Number of stale entries removed.

Example:

removed_count = repository.cleanup_incremental_cache()
print(f"Removed {removed_count} stale cache entries")

`repository.clear_incremental_cache()`

Clears all cached data for the incremental analysis system.

repository.clear_incremental_cache() -> None

Example:

repository.clear_incremental_cache()
print("Cache cleared")

`repository.search_text()`

Searches for literal text or regex patterns within files.

repository.search_text(query: str, file_pattern: str = "*.py") -> List[Dict[str, Any]]

Parameters:

query (str): The text or regex pattern to search for.
file_pattern (str): A glob pattern to filter files to search within. Defaults to "*.py".

Returns:

List[Dict[str, Any]]: A list of dictionaries representing search matches, with keys like file, line_number, line_content.

`repository.chunk_file_by_lines()`

Chunks a file’s content based on line count.

repository.chunk_file_by_lines(file_path: str, max_lines: int = 50) -> List[str]

Parameters:

file_path (str): The path to the file (relative to repo root) to chunk.
max_lines (int): The maximum number of lines per chunk. Defaults to 50.

Returns:

List[str]: A list of strings, where each string is a chunk of the file content.

`repository.chunk_file_by_symbols()`

Chunks a file’s content based on its top-level symbols (functions, classes).

repository.chunk_file_by_symbols(file_path: str) -> List[Dict[str, Any]]

Parameters:

file_path (str): The path to the file (relative to repo root) to chunk.

Returns:

List[Dict[str, Any]]: A list of dictionaries, each representing a symbol chunk with keys like name, type, code.

`repository.extract_context_around_line()`

Extracts the surrounding code context (typically the containing function or class) for a specific line number.

repository.extract_context_around_line(file_path: str, line: int) -> Optional[Dict[str, Any]]

Parameters:

file_path (str): The path to the file (relative to repo root).
line (int): The (0-based) line number to find context for.

Returns:

Optional[Dict[str, Any]]: A dictionary representing the symbol context (with keys like name, type, code), or None if no context is found.

`repository.index()`

Builds and returns a comprehensive index of the repository, including both the file tree and all extracted symbols.

repository.index() -> Dict[str, Any]

Returns:

Dict[str, Any]: A dictionary containing the full index, typically with keys like file_tree and symbols.

`repository.get_vector_searcher()`

Initializes and returns the VectorSearcher instance for performing semantic search.

repository.get_vector_searcher(embed_fn=None, backend=None, persist_dir=None) -> VectorSearcher

Parameters:

embed_fn (Callable): Required on first call. A function that takes a list of strings and returns a list of embedding vectors.
backend (Optional[Any]): Specifies the vector database backend. If None, kit defaults to using ChromaDBBackend.
persist_dir (Optional[str]): Path to a directory to persist the vector index. If None, the VectorSearcher will default to YOUR_REPO_PATH/.kit/vector_db/ for ChromaDB. Setting to None implies using this default persistence path for ChromaDB.

Returns:

VectorSearcher: An instance of the vector searcher configured for this repository.

(See Configuring Semantic Search for more details.)

`repository.search_semantic()`

Performs a semantic search query over the indexed codebase.

repository.search_semantic(query: str, top_k: int = 5, embed_fn=None) -> List[Dict[str, Any]]

Parameters:

query (str): The natural language query to search for.
top_k (int): The maximum number of results to return. Defaults to 5.
embed_fn (Callable): Required if the vector searcher hasn’t been initialized yet.

Returns:

List[Dict[str, Any]]: A list of dictionaries representing the search results, typically including matched code snippets and relevance scores.

`repository.find_symbol_usages()`

Finds definitions and references of a specific symbol across the repository.

repository.find_symbol_usages(symbol_name: str, symbol_type: Optional[str] = None) -> List[Dict[str, Any]]

Parameters:

symbol_name (str): The name of the symbol to find usages for.
symbol_type (Optional[str]): Optionally restrict the search to a specific symbol type (e.g., ‘function’, ‘class’). Defaults to None (search all types).

Returns:

List[Dict[str, Any]]: A list of dictionaries representing symbol usages, including file, line number, and context/snippet.

`repository.write_index()`

Writes the full repository index (file tree and symbols) to a JSON file.

repository.write_index(file_path: str) -> None

Parameters:

file_path (str): The path to the output JSON file.

`repository.write_symbol_usages()`

Writes the found usages of a specific symbol to a JSON file.

repository.write_symbol_usages(symbol_name: str, file_path: str, symbol_type: Optional[str] = None) -> None

Parameters:

symbol_name (str): The name of the symbol whose usages were found.
file_path (str): The path to the output JSON file.
symbol_type (Optional[str]): The symbol type filter used when finding usages. Defaults to None.

`repository.get_context_assembler()`

Convenience helper that returns a fresh ContextAssembler bound to this repository. Use it instead of importing the class directly:

assembler = repository.get_context_assembler()
assembler.add_diff(my_diff)
context_blob = assembler.format_context()

Returns:

ContextAssembler: Ready-to-use assembler instance.

`repository.get_summarizer()`

Automatic cleanup of the tmp repo cache

When you work with remote URLs Kit tries to be fast by keeping a shallow clone on disk (default location: /tmp/kit-repo-cache). Over time those can add up, especially in containerised environments. With cache_ttl_hours (or the KIT_TMP_REPO_TTL_HOURS env-var) you can make Kit self-purge old clones:

# Purge anything older than 2 hours, then clone
repo = Repository(
    "https://github.com/owner/repo",
    cache_ttl_hours=2,
)

When does the purge run?

The very first time _clone_github_repo is executed in a Python process (i.e., when the first remote repo is opened).
A light-weight helper walks the top-level folders inside the cache directory and removes those whose directory mtime is older than the TTL.
The helper is wrapped in functools.lru_cache(maxsize=1), so subsequent clones in the same process don’t repeat the walk.

If you never set a TTL Kit keeps the previous behavior (clones live until the OS or you remove them).

Repository API

Repository API

Core Methods

get_file_tree(subpath=None)

grep(pattern, *, case_sensitive=True, include_pattern=None, exclude_pattern=None, max_results=1000)

extract_symbols(file_path=None)

search_text(query, file_pattern="*")

get_file_content(file_path)

Properties

current_sha

current_sha_short

current_branch

remote_url

is_dirty

Utility Methods

get_abs_path(relative_path)

index()

write_symbols(file_path, symbols=None)

write_file_tree(file_path)

find_symbol_usages(symbol_name, symbol_type=None)

Examples

Basic Repository Analysis

Working with Subdirectories

File Content Operations

Initialization

Incremental Analysis Methods

Creating a Repository Instance

repository.get_file_tree()

repository.get_file_content()

repository.extract_symbols()

repository.extract_symbols_incremental()

repository.get_incremental_stats()

repository.cleanup_incremental_cache()

repository.clear_incremental_cache()

repository.search_text()

repository.chunk_file_by_lines()

repository.chunk_file_by_symbols()

repository.extract_context_around_line()

repository.index()

repository.get_vector_searcher()

repository.search_semantic()

repository.find_symbol_usages()

repository.write_index()

repository.write_symbol_usages()

repository.get_context_assembler()

repository.get_summarizer()

Automatic cleanup of the tmp repo cache

`get_file_tree(subpath=None)`

`grep(pattern, *, case_sensitive=True, include_pattern=None, exclude_pattern=None, max_results=1000)`

`extract_symbols(file_path=None)`

`search_text(query, file_pattern="*")`

`get_file_content(file_path)`

`current_sha`

`current_sha_short`

`current_branch`

`remote_url`

`is_dirty`

`get_abs_path(relative_path)`

`index()`

`write_symbols(file_path, symbols=None)`

`write_file_tree(file_path)`

`find_symbol_usages(symbol_name, symbol_type=None)`

Creating a `Repository` Instance

`repository.get_file_tree()`

`repository.get_file_content()`

`repository.extract_symbols()`

`repository.extract_symbols_incremental()`

`repository.get_incremental_stats()`

`repository.cleanup_incremental_cache()`

`repository.clear_incremental_cache()`

`repository.search_text()`

`repository.chunk_file_by_lines()`

`repository.chunk_file_by_symbols()`

`repository.extract_context_around_line()`

`repository.index()`

`repository.get_vector_searcher()`

`repository.search_semantic()`

`repository.find_symbol_usages()`

`repository.write_index()`

`repository.write_symbol_usages()`

`repository.get_context_assembler()`

`repository.get_summarizer()`