AGENT_TESTING.md

  1# Shelley Agent Testing Guide
  2
  3This document provides instructions for automated testing of the Shelley coding agent product.
  4
  5## Prerequisites
  6
  7- `ANTHROPIC_API_KEY` environment variable set
  8- Node.js and pnpm installed
  9- Go installed
 10- `headless` browser tool available (check with `which headless`)
 11
 12## Setup Instructions
 13
 14### 1. Build Shelley
 15
 16```bash
 17cd /path/to/shelley
 18make build
 19```
 20
 21This will:
 22- Build the UI (`pnpm install && pnpm run build`)
 23- Create template tarballs
 24- Build the Go binary to `bin/shelley`
 25
 26### 2. Install Playwright for E2E Tests
 27
 28```bash
 29cd ui
 30pnpm install
 31pnpm exec playwright install chromium
 32```
 33
 34### 3. Start Shelley Server
 35
 36For testing with Claude:
 37```bash
 38./bin/shelley --model claude-sonnet-4.5 --db test.db serve --port 9001
 39```
 40
 41For testing with predictable model (no API key needed):
 42```bash
 43./bin/shelley --model predictable --db test.db serve --port 9001
 44```
 45
 46### 4. Start Headless Browser (if using headless tool)
 47
 48```bash
 49headless start
 50```
 51
 52## Test Categories
 53
 54### CLI Tests
 55
 56Test these commands manually:
 57
 58```bash
 59# List available models
 60./bin/shelley models
 61```
 62
 63### E2E Tests (Automated)
 64
 65Run the full E2E test suite:
 66
 67```bash
 68cd ui
 69pnpm run test:e2e
 70```
 71
 72Run specific test files:
 73```bash
 74pnpm run test:e2e -- --grep "smoke"
 75pnpm run test:e2e -- --grep "conversation"
 76pnpm run test:e2e -- --grep "cancellation"
 77```
 78
 79### Headless Browser Testing
 80
 81```bash
 82# Navigate to Shelley
 83headless navigate http://localhost:9001
 84
 85# Check page title
 86headless eval 'document.title'
 87
 88# Get page content
 89headless eval 'document.body.innerText.slice(0, 2000)'
 90
 91# Take screenshot
 92headless screenshot screenshot.png
 93
 94# Set input value (React-compatible method)
 95headless eval '(() => {
 96  const input = document.querySelector("[data-testid=\"message-input\"]");
 97  const setter = Object.getOwnPropertyDescriptor(HTMLTextAreaElement.prototype, "value").set;
 98  setter.call(input, "Your message here");
 99  input.dispatchEvent(new Event("input", { bubbles: true }));
100  return "done";
101})()'
102
103# Click send button
104headless eval 'document.querySelector("[data-testid=\"send-button\"]").click()'
105
106# Check if agent is thinking
107headless eval 'document.querySelector("[data-testid=\"agent-thinking\"]")?.innerText || "not thinking"'
108
109# Check for errors
110headless eval 'document.querySelector("[role=\"alert\"]")?.innerText || "no errors"'
111```
112
113## Test Checklist
114
115### Things That Work Well (Regression Tests)
116
117- [ ] **Page loads correctly** - Title is "Shelley", message input visible
118- [ ] **Send button state** - Disabled when empty, enabled when text entered
119- [ ] **Claude integration** - Messages send and receive responses (~2-3 seconds)
120- [ ] **Prompt caching** - Check server logs for `cache_read_input_tokens`
121- [ ] **Tool execution - bash** - Ask to run `echo hello`, verify tool output
122- [ ] **Tool execution - think** - Send `think: analyzing...`, verify think tool appears
123- [ ] **Tool execution - patch** - Send `patch: test.txt`, verify patch tool appears
124- [ ] **Conversation persistence** - Multiple messages in same conversation work
125- [ ] **Enter key sends** - Press Enter in textarea to send message
126- [ ] **Model selector** - Shows available models in UI
127- [ ] **Working directory** - Shows current directory path
128- [ ] **Accessibility labels** - Input has `aria-label="Message input"`, button has `aria-label="Send message"`
129
130### Known Issues (Need Fixing/Re-checking)
131
132- [ ] **Empty message bug (CRITICAL)** - Rapid sequential messages cause 400 errors
133  - Test: Send 5+ messages quickly in succession
134  - Expected: All should succeed
135  - Actual: API returns `messages.N: all messages must have non-empty content`
136
137- [ ] **Cancellation state after reload** - Cancelled operations don't show "cancelled" text
138  - Test: Start `bash: sleep 100`, cancel it, reload page
139  - Expected: Should show "cancelled" or "[Operation cancelled]"
140  - Actual: Shows tool with `x` but no cancelled text
141
142- [ ] **Thinking indicator stuck on error** - Indicator doesn't hide when LLM fails
143  - Test: Trigger an LLM error (e.g., via rapid messages)
144  - Expected: Indicator should hide, error should display
145  - Actual: "Agent working..." stays visible indefinitely
146
147- [ ] **Menu button outside viewport** - Hamburger menu not clickable on mobile
148  - Test: On mobile viewport, try clicking menu button
149  - Expected: Menu should open
150  - Actual: Button reported as "outside of the viewport"
151
152- [ ] **Programmatic input filling** - Direct `.value` assignment doesn't enable send button
153  - Test: Use browser automation to set input value
154  - Expected: Send button should enable
155  - Actual: Button stays disabled (need to use native setter method)
156
157## Screenshots to Capture
158
159When testing, capture these screenshots for the report:
160
1611. `initial-load.png` - Fresh page load
1622. `message-typed.png` - Message in input field
1633. `agent-thinking.png` - Thinking indicator visible
1644. `response-received.png` - After Claude responds
1655. `tool-execution.png` - After a tool (bash/think/patch) runs
1666. `error-state.png` - If any errors occur
1677. `menu-open.png` - Sidebar/conversation list open
168
169## Report Template
170
171Create `test-report/SHELLEY_TEST_REPORT.md` with:
172
1731. **Executive Summary** - Overall pass/fail, key issues
1742. **Test Environment** - Platform, models tested, browser
1753. **Test Results Summary** - Table of categories and pass/fail counts
1764. **Issues Found** - Detailed description of each issue with:
177   - File/location
178   - Description
179   - Expected vs Actual
180   - Screenshot
181   - Impact
1825. **What's Working Well** - Positive findings
1836. **Recommendations** - Prioritized fixes (Critical/High/Medium/Low)
1847. **Screenshots Index** - List of captured screenshots
185
186## Common Issues & Solutions
187
188### Build fails with "no matching files found"
189```bash
190# Templates need to be built first
191make templates
192# Then build
193make build
194```
195
196### Playwright not finding chromium
197```bash
198cd ui
199pnpm exec playwright install chromium
200```
201
202### Server already running
203```bash
204# Find and kill existing process
205lsof -i :9001 | grep LISTEN | awk '{print $2}' | xargs kill
206```
207
208### Headless browser already running
209```bash
210headless stop
211headless start
212```
213
214## API Endpoints for Manual Testing
215
216```bash
217# List conversations
218curl http://localhost:9001/api/conversations
219
220# Get specific conversation
221curl http://localhost:9001/api/conversation/<id>
222
223# Create new conversation (POST)
224curl -X POST http://localhost:9001/api/conversations/new \
225  -H "Content-Type: application/json" \
226  -d '{"model":"claude-sonnet-4.5","cwd":"/path/to/dir"}'
227
228# Send message (POST)
229curl -X POST http://localhost:9001/api/conversation/<id>/chat \
230  -H "Content-Type: application/json" \
231  -d '{"content":"Hello!"}'
232
233# Stream conversation (SSE)
234curl http://localhost:9001/api/conversation/<id>/stream
235```
236
237## Server Logs to Watch
238
239When testing, monitor server output for:
240
241- `LLM request completed` - Shows model, duration, token usage, cost
242- `cache_creation_input_tokens` / `cache_read_input_tokens` - Prompt caching
243- `Generated slug for conversation` - Conversation naming
244- `400 Bad Request` or other errors - API failures
245- `Agent message` with `end_of_turn=true` - Conversation turns completing