README.md

 1# Tool Evals
 2
 3A framework for evaluating and benchmarking the agent panel generations.
 4
 5## Overview
 6
 7Tool Evals provides a headless environment for running assistants evaluations on code repositories. It automates the process of:
 8
 91. Setting up test code and repositories
102. Sending prompts to language models
113. Allowing the assistant to use tools to modify code
124. Collecting metrics on performance and tool usage
135. Evaluating results against known good solutions
14
15## How It Works
16
17The system consists of several key components:
18
19- **Eval**: Loads exercises from the zed-ace-framework repository, creates temporary repos, and executes evaluations
20- **HeadlessAssistant**: Provides a headless environment for running the AI assistant
21- **Judge**: Evaluates AI-generated solutions against reference implementations and assigns scores
22- **Templates**: Defines evaluation frameworks for different tasks (Project Creation, Code Modification, Conversational Guidance)
23
24## Setup Requirements
25
26### Prerequisites
27
28- Rust and Cargo
29- Git
30- Python (for report generation)
31- Network access to clone repositories
32- Appropriate API keys for language models and git services (Anthropic, GitHub, etc.)
33
34### Environment Variables
35
36Ensure you have the required API keys set, either from a dev run of Zed or via these environment variables:
37- `ZED_ANTHROPIC_API_KEY` for Claude models
38- `ZED_GITHUB_API_KEY` for GitHub API (or similar)
39
40## Usage
41
42### Running Evaluations
43
44```bash
45# Run all tests
46cargo run -p assistant_eval -- --all
47
48# Run only specific languages
49cargo run -p assistant_eval -- --all --languages python,rust
50
51# Limit concurrent evaluations
52cargo run -p assistant_eval -- --all --concurrency 5
53
54# Limit number of exercises per language
55cargo run -p assistant_eval -- --all --max-exercises-per-language 3
56```
57
58### Evaluation Template Types
59
60The system supports three types of evaluation templates:
61
621. **ProjectCreation**: Tests the model's ability to create new implementations from scratch
632. **CodeModification**: Tests the model's ability to modify existing code to meet new requirements
643. **ConversationalGuidance**: Tests the model's ability to provide guidance without writing code
65
66### Support Repo
67
68The [zed-industries/zed-ace-framework](https://github.com/zed-industries/zed-ace-framework) contains the analytics and reporting scripts.