Shelley Agent Testing Guide
This document provides instructions for automated testing of the Shelley coding agent product.
Prerequisites
ANTHROPIC_API_KEYenvironment variable set- Node.js and pnpm installed
- Go installed
headlessbrowser tool available (check withwhich headless)
Setup Instructions
1. Build Shelley
cd /path/to/shelley
make build
This will:
- Build the UI (
pnpm install && pnpm run build) - Create template tarballs
- Build the Go binary to
bin/shelley
2. Install Playwright for E2E Tests
cd ui
pnpm install
pnpm exec playwright install chromium
3. Start Shelley Server
For testing with Claude:
./bin/shelley --model claude-sonnet-4.5 --db test.db serve --port 9001
For testing with predictable model (no API key needed):
./bin/shelley --model predictable --db test.db serve --port 9001
4. Start Headless Browser (if using headless tool)
headless start
Test Categories
CLI Tests
Test these commands manually:
# List available models
./bin/shelley models
E2E Tests (Automated)
Run the full E2E test suite:
cd ui
pnpm run test:e2e
Run specific test files:
pnpm run test:e2e -- --grep "smoke"
pnpm run test:e2e -- --grep "conversation"
pnpm run test:e2e -- --grep "cancellation"
Headless Browser Testing
# Navigate to Shelley
headless navigate http://localhost:9001
# Check page title
headless eval 'document.title'
# Get page content
headless eval 'document.body.innerText.slice(0, 2000)'
# Take screenshot
headless screenshot screenshot.png
# Set input value (React-compatible method)
headless eval '(() => {
const input = document.querySelector("[data-testid=\"message-input\"]");
const setter = Object.getOwnPropertyDescriptor(HTMLTextAreaElement.prototype, "value").set;
setter.call(input, "Your message here");
input.dispatchEvent(new Event("input", { bubbles: true }));
return "done";
})()'
# Click send button
headless eval 'document.querySelector("[data-testid=\"send-button\"]").click()'
# Check if agent is thinking
headless eval 'document.querySelector("[data-testid=\"agent-thinking\"]")?.innerText || "not thinking"'
# Check for errors
headless eval 'document.querySelector("[role=\"alert\"]")?.innerText || "no errors"'
Test Checklist
Things That Work Well (Regression Tests)
- Page loads correctly - Title is "Shelley", message input visible
- Send button state - Disabled when empty, enabled when text entered
- Claude integration - Messages send and receive responses (~2-3 seconds)
- Prompt caching - Check server logs for
cache_read_input_tokens - Tool execution - bash - Ask to run
echo hello, verify tool output - Tool execution - think - Send
think: analyzing..., verify think tool appears - Tool execution - patch - Send
patch: test.txt, verify patch tool appears - Conversation persistence - Multiple messages in same conversation work
- Enter key sends - Press Enter in textarea to send message
- Model selector - Shows available models in UI
- Working directory - Shows current directory path
- Accessibility labels - Input has
aria-label="Message input", button hasaria-label="Send message"
Known Issues (Need Fixing/Re-checking)
-
Empty message bug (CRITICAL) - Rapid sequential messages cause 400 errors
- Test: Send 5+ messages quickly in succession
- Expected: All should succeed
- Actual: API returns
messages.N: all messages must have non-empty content
-
Cancellation state after reload - Cancelled operations don't show "cancelled" text
- Test: Start
bash: sleep 100, cancel it, reload page - Expected: Should show "cancelled" or "[Operation cancelled]"
- Actual: Shows tool with
xbut no cancelled text
- Test: Start
-
Thinking indicator stuck on error - Indicator doesn't hide when LLM fails
- Test: Trigger an LLM error (e.g., via rapid messages)
- Expected: Indicator should hide, error should display
- Actual: "Agent working..." stays visible indefinitely
-
Menu button outside viewport - Hamburger menu not clickable on mobile
- Test: On mobile viewport, try clicking menu button
- Expected: Menu should open
- Actual: Button reported as "outside of the viewport"
-
Programmatic input filling - Direct
.valueassignment doesn't enable send button- Test: Use browser automation to set input value
- Expected: Send button should enable
- Actual: Button stays disabled (need to use native setter method)
Screenshots to Capture
When testing, capture these screenshots for the report:
initial-load.png- Fresh page loadmessage-typed.png- Message in input fieldagent-thinking.png- Thinking indicator visibleresponse-received.png- After Claude respondstool-execution.png- After a tool (bash/think/patch) runserror-state.png- If any errors occurmenu-open.png- Sidebar/conversation list open
Report Template
Create test-report/SHELLEY_TEST_REPORT.md with:
- Executive Summary - Overall pass/fail, key issues
- Test Environment - Platform, models tested, browser
- Test Results Summary - Table of categories and pass/fail counts
- Issues Found - Detailed description of each issue with:
- File/location
- Description
- Expected vs Actual
- Screenshot
- Impact
- What's Working Well - Positive findings
- Recommendations - Prioritized fixes (Critical/High/Medium/Low)
- Screenshots Index - List of captured screenshots
Common Issues & Solutions
Build fails with "no matching files found"
# Templates need to be built first
make templates
# Then build
make build
Playwright not finding chromium
cd ui
pnpm exec playwright install chromium
Server already running
# Find and kill existing process
lsof -i :9001 | grep LISTEN | awk '{print $2}' | xargs kill
Headless browser already running
headless stop
headless start
API Endpoints for Manual Testing
# List conversations
curl http://localhost:9001/api/conversations
# Get specific conversation
curl http://localhost:9001/api/conversation/<id>
# Create new conversation (POST)
curl -X POST http://localhost:9001/api/conversations/new \
-H "Content-Type: application/json" \
-d '{"model":"claude-sonnet-4.5","cwd":"/path/to/dir"}'
# Send message (POST)
curl -X POST http://localhost:9001/api/conversation/<id>/chat \
-H "Content-Type: application/json" \
-d '{"content":"Hello!"}'
# Stream conversation (SSE)
curl http://localhost:9001/api/conversation/<id>/stream
Server Logs to Watch
When testing, monitor server output for:
LLM request completed- Shows model, duration, token usage, costcache_creation_input_tokens/cache_read_input_tokens- Prompt cachingGenerated slug for conversation- Conversation naming400 Bad Requestor other errors - API failuresAgent messagewithend_of_turn=true- Conversation turns completing