1# Shelley Agent Testing Guide
2
3This document provides instructions for automated testing of the Shelley coding agent product.
4
5## Prerequisites
6
7- `ANTHROPIC_API_KEY` environment variable set
8- Node.js and pnpm installed
9- Go installed
10- `headless` browser tool available (check with `which headless`)
11
12## Setup Instructions
13
14### 1. Build Shelley
15
16```bash
17cd /path/to/shelley
18make build
19```
20
21This will:
22- Build the UI (`pnpm install && pnpm run build`)
23- Create template tarballs
24- Build the Go binary to `bin/shelley`
25
26### 2. Install Playwright for E2E Tests
27
28```bash
29cd ui
30pnpm install
31pnpm exec playwright install chromium
32```
33
34### 3. Start Shelley Server
35
36For testing with Claude:
37```bash
38./bin/shelley --model claude-sonnet-4.5 --db test.db serve --port 9001
39```
40
41For testing with predictable model (no API key needed):
42```bash
43./bin/shelley --model predictable --db test.db serve --port 9001
44```
45
46### 4. Start Headless Browser (if using headless tool)
47
48```bash
49headless start
50```
51
52## Test Categories
53
54### CLI Tests
55
56Test these commands manually:
57
58```bash
59# List available models
60./bin/shelley models
61```
62
63### E2E Tests (Automated)
64
65Run the full E2E test suite:
66
67```bash
68cd ui
69pnpm run test:e2e
70```
71
72Run specific test files:
73```bash
74pnpm run test:e2e -- --grep "smoke"
75pnpm run test:e2e -- --grep "conversation"
76pnpm run test:e2e -- --grep "cancellation"
77```
78
79### Headless Browser Testing
80
81```bash
82# Navigate to Shelley
83headless navigate http://localhost:9001
84
85# Check page title
86headless eval 'document.title'
87
88# Get page content
89headless eval 'document.body.innerText.slice(0, 2000)'
90
91# Take screenshot
92headless screenshot screenshot.png
93
94# Set input value (React-compatible method)
95headless eval '(() => {
96 const input = document.querySelector("[data-testid=\"message-input\"]");
97 const setter = Object.getOwnPropertyDescriptor(HTMLTextAreaElement.prototype, "value").set;
98 setter.call(input, "Your message here");
99 input.dispatchEvent(new Event("input", { bubbles: true }));
100 return "done";
101})()'
102
103# Click send button
104headless eval 'document.querySelector("[data-testid=\"send-button\"]").click()'
105
106# Check if agent is thinking
107headless eval 'document.querySelector("[data-testid=\"agent-thinking\"]")?.innerText || "not thinking"'
108
109# Check for errors
110headless eval 'document.querySelector("[role=\"alert\"]")?.innerText || "no errors"'
111```
112
113## Test Checklist
114
115### Things That Work Well (Regression Tests)
116
117- [ ] **Page loads correctly** - Title is "Shelley", message input visible
118- [ ] **Send button state** - Disabled when empty, enabled when text entered
119- [ ] **Claude integration** - Messages send and receive responses (~2-3 seconds)
120- [ ] **Prompt caching** - Check server logs for `cache_read_input_tokens`
121- [ ] **Tool execution - bash** - Ask to run `echo hello`, verify tool output
122- [ ] **Tool execution - think** - Send `think: analyzing...`, verify think tool appears
123- [ ] **Tool execution - patch** - Send `patch: test.txt`, verify patch tool appears
124- [ ] **Conversation persistence** - Multiple messages in same conversation work
125- [ ] **Enter key sends** - Press Enter in textarea to send message
126- [ ] **Model selector** - Shows available models in UI
127- [ ] **Working directory** - Shows current directory path
128- [ ] **Accessibility labels** - Input has `aria-label="Message input"`, button has `aria-label="Send message"`
129
130### Known Issues (Need Fixing/Re-checking)
131
132- [ ] **Empty message bug (CRITICAL)** - Rapid sequential messages cause 400 errors
133 - Test: Send 5+ messages quickly in succession
134 - Expected: All should succeed
135 - Actual: API returns `messages.N: all messages must have non-empty content`
136
137- [ ] **Cancellation state after reload** - Cancelled operations don't show "cancelled" text
138 - Test: Start `bash: sleep 100`, cancel it, reload page
139 - Expected: Should show "cancelled" or "[Operation cancelled]"
140 - Actual: Shows tool with `x` but no cancelled text
141
142- [ ] **Thinking indicator stuck on error** - Indicator doesn't hide when LLM fails
143 - Test: Trigger an LLM error (e.g., via rapid messages)
144 - Expected: Indicator should hide, error should display
145 - Actual: "Agent working..." stays visible indefinitely
146
147- [ ] **Menu button outside viewport** - Hamburger menu not clickable on mobile
148 - Test: On mobile viewport, try clicking menu button
149 - Expected: Menu should open
150 - Actual: Button reported as "outside of the viewport"
151
152- [ ] **Programmatic input filling** - Direct `.value` assignment doesn't enable send button
153 - Test: Use browser automation to set input value
154 - Expected: Send button should enable
155 - Actual: Button stays disabled (need to use native setter method)
156
157## Screenshots to Capture
158
159When testing, capture these screenshots for the report:
160
1611. `initial-load.png` - Fresh page load
1622. `message-typed.png` - Message in input field
1633. `agent-thinking.png` - Thinking indicator visible
1644. `response-received.png` - After Claude responds
1655. `tool-execution.png` - After a tool (bash/think/patch) runs
1666. `error-state.png` - If any errors occur
1677. `menu-open.png` - Sidebar/conversation list open
168
169## Report Template
170
171Create `test-report/SHELLEY_TEST_REPORT.md` with:
172
1731. **Executive Summary** - Overall pass/fail, key issues
1742. **Test Environment** - Platform, models tested, browser
1753. **Test Results Summary** - Table of categories and pass/fail counts
1764. **Issues Found** - Detailed description of each issue with:
177 - File/location
178 - Description
179 - Expected vs Actual
180 - Screenshot
181 - Impact
1825. **What's Working Well** - Positive findings
1836. **Recommendations** - Prioritized fixes (Critical/High/Medium/Low)
1847. **Screenshots Index** - List of captured screenshots
185
186## Common Issues & Solutions
187
188### Build fails with "no matching files found"
189```bash
190# Templates need to be built first
191make templates
192# Then build
193make build
194```
195
196### Playwright not finding chromium
197```bash
198cd ui
199pnpm exec playwright install chromium
200```
201
202### Server already running
203```bash
204# Find and kill existing process
205lsof -i :9001 | grep LISTEN | awk '{print $2}' | xargs kill
206```
207
208### Headless browser already running
209```bash
210headless stop
211headless start
212```
213
214## API Endpoints for Manual Testing
215
216```bash
217# List conversations
218curl http://localhost:9001/api/conversations
219
220# Get specific conversation
221curl http://localhost:9001/api/conversation/<id>
222
223# Create new conversation (POST)
224curl -X POST http://localhost:9001/api/conversations/new \
225 -H "Content-Type: application/json" \
226 -d '{"model":"claude-sonnet-4.5","cwd":"/path/to/dir"}'
227
228# Send message (POST)
229curl -X POST http://localhost:9001/api/conversation/<id>/chat \
230 -H "Content-Type: application/json" \
231 -d '{"content":"Hello!"}'
232
233# Stream conversation (SSE)
234curl http://localhost:9001/api/conversation/<id>/stream
235```
236
237## Server Logs to Watch
238
239When testing, monitor server output for:
240
241- `LLM request completed` - Shows model, duration, token usage, cost
242- `cache_creation_input_tokens` / `cache_read_input_tokens` - Prompt caching
243- `Generated slug for conversation` - Conversation naming
244- `400 Bad Request` or other errors - API failures
245- `Agent message` with `end_of_turn=true` - Conversation turns completing