Incremental Analysis & Caching
kit’s incremental analysis system provides intelligent caching that improves performance for repeated operations. By caching symbol extraction results and using sophisticated invalidation strategies, kit achieves 25x performance improvements for warm cache scenarios.
Overview
The incremental analysis system consists of two main components:
- FileAnalysisCache: Manages file-level caching with multiple invalidation strategies
- IncrementalAnalyzer: Orchestrates analysis with performance tracking and statistics
Performance Benefits
Real-world performance improvements on the kit repository (60k+ symbols, 7,606 files):
- Cold cache: 27.8s (2,158 symbols/sec)
- Warm cache: 0.76s (78,823 symbols/sec)
- Speedup: 36.5x faster
Efficient Change Detection
Key Insight: Kit only analyzes files that have actually changed, avoiding expensive re-analysis of unchanged files.
When you call extract_symbols_incremental(), Kit performs these steps:
- File Discovery: One-time walk to find all supported files (
.py,.js,.ts, etc.) - Change Detection: For each file, quickly check if it changed using mtime, size, and content hash
- Selective Analysis: Only analyze changed files using tree-sitter
- Cache Retrieval: Return cached results instantly for unchanged files
Real-World Impact
If you have 1,000 files in your repository and only 1 file changes:
Analyzing 1 changed files (skipping 999 cached)This selective approach is the core reason for Kit’s dramatic performance improvements - it avoids redundant work entirely.
Behind the Scenes
The analyze_changed_files() method demonstrates this filtering:
def analyze_changed_files(self, file_paths): results = {} changed_files = []
# Filter to only changed files for file_path in file_paths: if self.cache.is_file_changed(file_path): changed_files.append(file_path) else: # Use cached results instantly cached_symbols = self.cache.get_cached_symbols(file_path) if cached_symbols: results[str(file_path)] = cached_symbols self._stats["cache_hits"] += 1
logger.info(f"Analyzing {len(changed_files)} changed files " f"(skipping {len(file_paths) - len(changed_files)} cached)")
# Only analyze the changed files for file_path in changed_files: symbols = self.analyze_file(file_path) results[str(file_path)] = symbols
return resultsQuick Start
Basic Usage
from kit.repository import Repository
# Create repository instancerepo = Repository("/path/to/your/project")
# First analysis - builds cachesymbols = repo.extract_symbols_incremental()print(f"Found {len(symbols)} symbols")
# Second analysis - uses cache (much faster)symbols = repo.extract_symbols_incremental()print(f"Found {len(symbols)} symbols (cached)")
# Get performance statisticsstats = repo.get_incremental_stats()print(f"Cache hit rate: {stats['cache_hit_rate']}")print(f"Files analyzed: {stats['files_analyzed']}")CLI Usage
The incremental analysis system is also available through the CLI:
# Extract symbols with cachingkit symbols /path/to/repo
# View cache statisticskit cache status /path/to/repo
# Clear cache if neededkit cache clear /path/to/repo
# Cleanup stale entrieskit cache cleanup /path/to/repoCache Invalidation Strategies
The system uses multiple strategies to detect file changes:
1. Modification Time (mtime)
Fast check using file system metadata:
if cached_metadata.get("mtime") != current_metadata.get("mtime"): return True # File changed2. File Size
Quick validation of file size changes:
if cached_metadata.get("size") != current_metadata.get("size"): return True # File changed3. Content Hash
Definitive change detection using SHA-256:
if cached_metadata.get("hash") != current_metadata.get("hash"): return True # File changed4. Git State Detection
Automatic cache invalidation when git state changes:
# Automatically detects branch switches, commits, etc.if self._check_git_state_changed(): self._invalidate_caches()Advanced Features
LRU Eviction
The cache implements Least Recently Used (LRU) eviction to manage memory:
from kit.incremental_analyzer import FileAnalysisCache
# Create cache with custom size limitcache = FileAnalysisCache( repo_path=Path("/path/to/repo"), max_cache_size=5000 # Limit to 5000 files)Batch Analysis
Analyze multiple files efficiently with automatic change filtering:
from pathlib import Path
# Get incremental analyzeranalyzer = repo.incremental_analyzer
# Analyze specific files - only changed ones will be processedfiles = [Path("src/main.py"), Path("src/utils.py"), Path("src/models.py")]results = analyzer.analyze_changed_files(files)
# Results contain symbols for all files, but only changed files were analyzedfor file_path, symbols in results.items(): print(f"{file_path}: {len(symbols)} symbols")The analyze_changed_files() method automatically:
- ✅ Filters input files to only analyze those that changed
- ✅ Returns cached results for unchanged files
- ✅ Logs exactly how many files were skipped vs. analyzed
- ✅ Maintains the same output format regardless of cache hits/misses
Performance Tracking
Monitor analysis performance:
# Get detailed statisticsstats = repo.get_incremental_stats()
print(f"Cache hit rate: {stats['cache_hit_rate']}")print(f"Files analyzed: {stats['files_analyzed']}")print(f"Cache hits: {stats['cache_hits']}")print(f"Cache misses: {stats['cache_misses']}")print(f"Average analysis time: {stats['avg_analysis_time']:.3f}s")print(f"Cache size: {stats['cache_size_mb']:.1f}MB")Cache Management
Cache Status
Check cache health and statistics:
# Get cache statisticscache_stats = repo.incremental_analyzer.cache.get_cache_stats()
print(f"Cached files: {cache_stats['cached_files']}")print(f"Total symbols: {cache_stats['total_symbols']}")print(f"Cache size: {cache_stats['cache_size_mb']:.1f}MB")print(f"Cache directory: {cache_stats['cache_dir']}")Cache Cleanup
Remove stale entries for deleted files:
# Clean up stale cache entriesremoved_count = repo.cleanup_incremental_cache()print(f"Removed {removed_count} stale entries")Cache Clearing
Clear all cached data:
# Clear all cache datarepo.clear_incremental_cache()print("Cache cleared")CLI Commands
Cache Status
# View cache statisticskit cache status /path/to/repoOutput:
Cache Statistics: Cached files: 1,234 Total symbols: 45,678 Cache size: 12.3MB Cache directory: /path/to/repo/.kit/incremental_cache Hit rate: 85.2%Cache Management
# Clear all cache datakit cache clear /path/to/repo
# Clean up stale entrieskit cache cleanup /path/to/repo
# View detailed statisticskit cache stats /path/to/repoIntegration with Git
The incremental analysis system automatically integrates with git operations:
Automatic Invalidation
Cache is automatically invalidated when:
- Switching branches (
git checkout) - Making commits (
git commit) - Merging branches (
git merge) - Rebasing (
git rebase) - Any operation that changes the git SHA
Example Workflow
# Initial analysis on main branchrepo = Repository("/path/to/repo")symbols_main = repo.extract_symbols_incremental()print(f"Main branch: {len(symbols_main)} symbols")
# Switch to feature branch (cache automatically invalidated)# git checkout feature-branch
# Analysis on feature branch (cache rebuilt)symbols_feature = repo.extract_symbols_incremental()print(f"Feature branch: {len(symbols_feature)} symbols")
# Switch back to main (cache invalidated again)# git checkout main
# Analysis back on main (cache rebuilt for main)symbols_main_again = repo.extract_symbols_incremental()print(f"Back to main: {len(symbols_main_again)} symbols")Best Practices
1. Use Incremental Analysis for Development
For development workflows where you’re repeatedly analyzing the same codebase:
# Use incremental analysis for better performancesymbols = repo.extract_symbols_incremental()
# Instead of traditional analysis# symbols = repo.extract_symbols() # Slower2. Monitor Cache Performance
Track cache effectiveness:
def analyze_with_monitoring(repo): start_time = time.time() symbols = repo.extract_symbols_incremental() elapsed = time.time() - start_time
stats = repo.get_incremental_stats() print(f"Analysis: {elapsed:.2f}s, Hit rate: {stats['cache_hit_rate']}")
return symbols3. Cleanup in CI/CD
Clean up cache in CI/CD environments:
# In CI/CD pipelinekit cache cleanup /path/to/repokit symbols /path/to/repo --format json > symbols.json4. Cache Size Management
Monitor and manage cache size:
# Check cache size periodicallystats = repo.get_incremental_stats()if stats['cache_size_mb'] > 100: # 100MB limit repo.cleanup_incremental_cache()Troubleshooting
Cache Not Working
If cache doesn’t seem to be working:
- Check git state: Ensure you’re not switching branches frequently
- Verify file stability: Check if files are being modified
- Monitor statistics: Use
get_incremental_stats()to debug
# Debug cache behaviorstats = repo.get_incremental_stats()print(f"Cache hits: {stats['cache_hits']}")print(f"Cache misses: {stats['cache_misses']}")print(f"Hit rate: {stats['cache_hit_rate']}")Performance Issues
If performance is still slow:
-
Clear and rebuild cache:
repo.clear_incremental_cache()symbols = repo.extract_symbols_incremental() -
Check cache size limits:
# Increase cache size if neededanalyzer = repo.incremental_analyzeranalyzer.cache.max_cache_size = 20000 -
Monitor file changes:
# Check if files are changing unexpectedlyfor file_path in changed_files:if analyzer.cache.is_file_changed(file_path):print(f"File changed: {file_path}")
Cache Corruption
If cache becomes corrupted:
# Reset everythingrepo.clear_incremental_cache()repo.incremental_analyzer.cache.clear_cache()
# Rebuild from scratchsymbols = repo.extract_symbols_incremental()Implementation Details
Cache Storage
Cache data is stored in the repository’s .kit directory:
.kit/├── incremental_cache/│ ├── analysis_metadata.json # File metadata (mtime, size, hash)│ └── symbols_cache.json # Cached symbol dataMemory Management
- LRU eviction prevents unlimited memory growth
- Configurable cache size (default: 10,000 files)
- Automatic cleanup of stale entries
Thread Safety
The cache is designed for single-threaded use. For multi-threaded scenarios:
# Create separate instances per threaddef worker_thread(): repo = Repository("/path/to/repo") # New instance symbols = repo.extract_symbols_incremental()The incremental analysis system makes kit’s symbol extraction dramatically faster for development workflows while maintaining complete accuracy and reliability.