README.md

 1# Tool Evals
 2
 3A framework for evaluating and benchmarking AI assistant performance in the Zed editor.
 4
 5## Overview
 6
 7Tool Evals provides a headless environment for running assistants evaluations on code repositories. It automates the process of:
 8
 91. Cloning and setting up test repositories
102. Sending prompts to language models
113. Allowing the assistant to use tools to modify code
124. Collecting metrics on performance
135. Evaluating results against known good solutions
14
15## How It Works
16
17The system consists of several key components:
18
19- **Eval**: Loads test cases from the evaluation_data directory, clones repos, and executes evaluations
20- **HeadlessAssistant**: Provides a headless environment for running the AI assistant
21- **Judge**: Compares AI-generated diffs with reference solutions and scores their functional similarity
22
23The evaluation flow:
241. An evaluation is loaded from the evaluation_data directory
252. The target repository is cloned and checked out at a specific commit
263. A HeadlessAssistant instance is created with the specified language model
274. The user prompt is sent to the assistant
285. The assistant responds and uses tools to modify code
296. Upon completion, a diff is generated from the changes
307. Results are saved including the diff, assistant's response, and performance metrics
318. If a reference solution exists, a Judge evaluates the similarity of the solution
32
33## Setup Requirements
34
35### Prerequisites
36
37- Rust and Cargo
38- Git
39- Network access to clone repositories
40- Appropriate API keys for language models and git services (Anthropic, GitHub, etc.)
41
42### Environment Variables
43
44Ensure you have the required API keys set, either from a dev run of Zed or via these environment variables:
45- `ZED_ANTHROPIC_API_KEY` for Claude models
46- `ZED_OPENAI_API_KEY` for OpenAI models
47- `ZED_GITHUB_API_KEY` for GitHub API (or similar)
48
49## Usage
50
51### Running a Single Evaluation
52
53To run a specific evaluation:
54
55```bash
56cargo run -p assistant_eval -- bubbletea-add-set-window-title
57```
58
59The arguments are regex patterns for the evaluation names to run, so to run all evaluations that contain `bubbletea`, run:
60
61```bash
62cargo run -p assistant_eval -- bubbletea
63```
64
65To run all evaluations:
66
67```bash
68cargo run -p assistant_eval -- --all
69```
70
71## Evaluation Data Structure
72
73Each evaluation should be placed in the `evaluation_data` directory with the following structure:
74
75* `prompt.txt`: The user's prompt.
76* `original.diff`: The `git diff` of the change anticipated for this prompt.
77* `setup.json`: Information about the repo used for the evaluation.