1---
2title: feat: Add durable live session recovery
3type: feat
4status: active
5date: 2026-04-28
6deepened: 2026-04-28
7---
8
9# feat: Add durable live session recovery
10
11## Overview
12
13Impeccable live mode should remain recoverable when the browser changes live-session state while the agent is not actively polling, after the agent is interrupted, or after the helper server/page reloads. The change adds a durable live-session journal, a resumable server/browser state contract, and an agent-facing status command so a fresh agent can reconstruct the active session and continue from the correct next action.
14
15This plan does not change the core source-first live-mode architecture from `docs/adr-live-variant-mode.md`: variants are still written to source, the browser still previews through HMR or `/source`, and the agent still uses long-poll. It makes the state machine durable and inspectable instead of relying on chat memory plus in-process queues.
16
17---
18
19## Problem Frame
20
21The observed failure mode was: the user accepted and tuned a live variant in the browser while the agent was not listening continuously. The browser moved forward, source entered a transitional state, and the agent no longer had authoritative context for what the user had done. Current code already has useful browser `localStorage` resume behavior and server in-memory queues, but neither is sufficient as a recovery source after interruption, server restart, browser cleanup, or missed polling.
22
23The product expectation is stronger: if the user changes live-mode state in the browser, a later agent should be able to ask, “what happened and what do I need to do next?” and receive a complete, durable answer.
24
25---
26
27## Requirements Trace
28
29- R1. Browser live-session changes are durably recorded outside browser memory and chat context.
30- R2. A new or interrupted agent can reconstruct the current live session, selected variant, parameter values, source file, and next required action.
31- R3. Accept/discard events are not silently lost when the agent is not polling or when the helper server restarts.
32- R4. Browser resume behavior treats source markers and durable session state as recoverable truth rather than hiding unfinished work behind a local handled flag.
33- R5. Durable replay is idempotent: duplicate, late, or conflicting events must not corrupt source.
34- R6. Carbonize-required accepts are represented as an explicit incomplete state until cleanup is done, and can reconcile to complete when source markers prove cleanup already happened.
35- R7. Existing browser/agent event transport remains self-contained, zero-dependency, token-protected, and compatible with the current SSE plus long-poll protocol. Status/resume may be implemented as a separate CLI/file-reader API.
36- R8. Agent delivery uses an explicit lease/ack model so a poll response is not treated as completed work until the agent reports a result.
37- R9. Annotated screenshots and generated-file fallback metadata remain recoverable while a session is incomplete.
38
39---
40
41## Scope Boundaries
42
43- This plan does not replace long-poll with WebSockets or a harness-specific integration.
44- This plan does not make accept fully atomic in one step. It records and resumes the incomplete state first; a later plan may remove carbonize as a manual cleanup stage.
45- This plan does not redesign generated variant quality or CSS specificity behavior, except where preview state metadata needs to be captured.
46- This plan does not introduce a database or external service. Durable state should live in project-local files managed by the helper server.
47- This plan does not support multiple simultaneous browser tabs editing different sessions as a first-class collaboration mode. It should avoid corrupting them, but single active session remains the baseline.
48
49### Deferred to Follow-Up Work
50
51- Atomic accept/carbonize: a later plan can make `live-accept.mjs` produce permanent clean source in one deterministic step.
52- Safer generated-preview layout defaults: a later plan can harden wrapper sizing, overflow, and scoped selector guidance.
53- Multi-user or multi-tab collaboration semantics: out of scope until live mode intentionally supports collaborative sessions.
54
55---
56
57## Context & Research
58
59### Relevant Code and Patterns
60
61- `skill/scripts/live-server.mjs` owns `/events`, `/poll`, `/source`, `/health`, token validation, in-memory `pendingEvents`, and browser SSE clients.
62- `skill/scripts/live-browser.js` owns picker state, variant cycling, parameter controls, `localStorage` session resume, handled-session sentinels, and accept/discard browser behavior.
63- `skill/scripts/live-poll.mjs` is the agent-facing poll client and auto-runs `live-accept.mjs` for accept/discard events.
64- `skill/scripts/live-accept.mjs` deterministically accepts/discards variant wrappers and can emit carbonize-required results.
65- `skill/scripts/live-wrap.mjs` creates source markers and original/variant wrapper structure.
66- `tests/live-server.test.mjs`, `tests/live-accept.test.mjs`, `tests/live-wrap.test.mjs`, and `tests/live-e2e.test.mjs` are the relevant verification surfaces.
67- `docs/adr-live-variant-mode.md` documents the current architecture and should be updated if the durable journal changes the lifecycle contract.
68
69### Institutional Learnings
70
71- `docs/adr-live-variant-mode.md` explicitly values source modification over DOM patching, zero-dependency scripts, SSE plus fetch, long-poll for agent compatibility, and `display: contents` wrappers.
72- `skill/reference/live.md` currently encodes the operational assumption that the agent continuously polls and performs carbonize cleanup before the next poll.
73
74### External References
75
76- External research skipped. This is a repo-owned local protocol and the existing architecture is well documented; local patterns are more authoritative than generic event-sourcing guidance.
77
78---
79
80## Key Technical Decisions
81
82- Add a project-local durable live-session store rather than relying on browser `localStorage` or server memory alone: this is the minimum change that lets any future agent reconstruct state.
83- Use append-only session events plus a compact session snapshot: append-only events preserve the audit trail for recovery; the snapshot makes status reads fast and simple. The append-only journal and source markers are canonical; snapshots are rebuildable caches.
84- Treat poll delivery as a leased work item, not queue removal: an event becomes terminal only after the agent posts a result/ack, and stale leases are redelivered idempotently.
85- Keep browser `localStorage` as a UI convenience, not the source of truth: source markers plus durable server journal should win when they disagree.
86- Add explicit server-side session states instead of ad hoc flags: states make accept/discard replay, stale checkpoint rejection, and invalid transitions enforceable.
87- Make accept/discard delivery acknowledged before the browser clears recoverable state: this prevents “browser thinks handled, source did not change” split-brain.
88- Add agent-facing status/resume scripts instead of requiring agents to inspect raw files: live mode needs a stable agent API, not folklore.
89
90---
91
92## Open Questions
93
94### Resolved During Planning
95
96- Should the first fix be durable resumability or atomic accept? Durable resumability comes first because it solves missed polling, interrupted agents, and transitional source states without requiring a full accept rewrite.
97- Should durability depend on external storage? No. Live mode is intentionally self-contained and should use project-local files.
98
99### Deferred to Implementation
100
101- Exact file naming inside the session store: implementation should choose a simple structure after reviewing how `.impeccable-live.json` is currently managed.
102- Exact retention policy defaults: implementation should start conservative and prune old completed sessions only after tests cover recovery.
103- Exact browser UI copy for pending accept/reconnect states: copy should be short and can be refined during implementation.
104- Exact helper-server restart UX: auto-rebind across token/port rotation is out of scope for this plan unless implementation discovers a safe lightweight path. The default recovery is `live-status` plus asking the user to reload/reopen the instrumented page when needed.
105
106---
107
108## Output Structure
109
110 skill/scripts/
111 live-session-store.mjs
112 live-status.mjs
113 live-resume.mjs
114 live-server.mjs
115 live-browser.js
116 live-poll.mjs
117 live-accept.mjs
118 tests/
119 live-session-store.test.mjs
120 live-server.test.mjs
121 live-browser-recovery.test.mjs
122 live-poll.test.mjs
123 live-status.test.mjs
124 live-e2e.test.mjs
125 package.json
126 docs/
127 adr-live-variant-mode.md
128
129---
130
131## High-Level Technical Design
132
133> *This illustrates the intended approach and is directional guidance for review, not implementation specification. The implementing agent should treat it as context, not code to reproduce.*
134
135```mermaid
136stateDiagram-v2
137 [*] --> configuring
138 configuring --> generating: browser Go
139 generating --> source_wrapped: agent wrapped source; file known
140 source_wrapped --> variants_written: agent wrote variants
141 variants_written --> cycling: agent replies done or browser sees wrapper
142 cycling --> accept_pending: browser Accept
143 cycling --> discard_pending: browser Discard
144 accept_pending --> accepted_source_pending: agent/live-accept wrote accepted variant with carbonize markers
145 accepted_source_pending --> complete: carbonize cleanup done
146 discard_pending --> complete: source restored
147 generating --> stranded: server/page/agent interruption
148 cycling --> stranded: server/page/agent interruption
149 accept_pending --> stranded: server/page/agent interruption
150 stranded --> generating: resume pending leased event
151 stranded --> source_wrapped: resume wrapper/original-only source
152 stranded --> cycling: resume from durable journal + source markers
153 stranded --> accepted_source_pending: resume from carbonize markers
154```
155
156```mermaid
157sequenceDiagram
158 participant B as Browser
159 participant S as Live server
160 participant J as Session journal
161 participant A as Agent
162 participant F as Source files
163
164 B->>S: POST /events accept_intent(session, variant, params)
165 S->>J: append event; rebuild/atomically update snapshot(accept_pending)
166 S-->>B: 202 accepted + durable sequence
167 A->>S: GET /poll or live-status
168 S->>J: lease next pending event
169 S-->>A: accept event + recovery context + lease id
170 A->>F: live-accept / cleanup
171 A->>S: POST /poll done + file/result + lease id
172 S->>J: append agent_result; mark lease acked; update snapshot
173 S-->>B: SSE done/committed
174 B->>B: clear local pending state only after committed
175```
176
177---
178
179## Implementation Units
180
181- U1. **Define durable session store**
182
183**Goal:** Add a small storage module that can append events, update snapshots, read active sessions, and recover pending work after process restart.
184
185**Requirements:** R1, R2, R3, R5, R7, R8, R9
186
187**Dependencies:** None
188
189**Files:**
190- Create: `skill/scripts/live-session-store.mjs`
191- Create: `tests/live-session-store.test.mjs`
192- Modify: `skill/scripts/live-server.mjs`
193- Modify: `package.json`
194
195**Approach:**
196- Store state under a project-local live directory, separate from `.impeccable-live.json` but discoverable from the same project root.
197- Write append-only JSONL events per session and a compact JSON snapshot per session. The JSONL journal is authoritative; snapshots are derived and must be rebuildable when missing, stale, or contradictory.
198- Include session id, event sequence, event type, phase, source file when known, visible variant, param values, timestamps, delivery lease/ack metadata, checkpoint revision, generated-file fallback mode, and annotation artifact paths.
199- Persist annotation screenshot assets for incomplete sessions when generate events include annotations; status should report missing artifacts as diagnostics.
200- Keep the module dependency-free and synchronous or simple async Node filesystem code consistent with current scripts.
201- Route browser checkpoints through `/events` as a first-class `checkpoint` event type, with monotonic client revision and session owner metadata.
202- Wire new non-E2E tests into the default test script, because `package.json` enumerates test files explicitly rather than globbing all `tests/*.test.mjs`.
203- Treat malformed journal entries as recoverable diagnostics rather than crashing the helper server when possible.
204
205**Execution note:** Start test-first for the storage module because it is the new source of truth.
206
207**Technical design:** Directional event shape:
208
209```text
210SessionSnapshot = {
211 id,
212 phase,
213 pageUrl,
214 sourceFile,
215 expectedVariants,
216 arrivedVariants,
217 visibleVariant,
218 paramValues,
219 pendingEventSeq,
220 deliveryLease,
221 checkpointRevision,
222 activeOwner,
223 sourceMarkers,
224 fallbackMode,
225 annotationArtifacts,
226 diagnostics,
227 updatedAt
228}
229```
230
231**Patterns to follow:**
232- `skill/scripts/live-server.mjs` for project-root PID file handling.
233- `tests/live-server.test.mjs` for temp-directory test isolation.
234
235**Test scenarios:**
236- Happy path: appending `generate`, `variants_ready`, and `accept_intent` events for one session yields a snapshot with the latest phase and selected variant.
237- Happy path: reading active sessions after constructing a new store instance returns the session written by the previous store instance.
238- Edge case: duplicate event id or sequence is idempotent and does not append conflicting state twice.
239- Edge case: a completed session remains readable for audit but is not returned as the active session by default.
240- Edge case: journal and snapshot disagree after a simulated crash; status rebuilds the snapshot from the journal and records a repair diagnostic.
241- Edge case: annotated generate event keeps its screenshot artifact available until the session completes.
242- Error path: corrupted JSONL line is reported in diagnostics while valid prior events still reconstruct the snapshot.
243- Error path: storage directory creation failure returns a structured error to the caller.
244
245**Verification:**
246- Durable state survives module re-instantiation and can answer “what is the active session and next action?” without browser or server memory.
247
248---
249
250- U2. **Journal browser events and enforce session transitions**
251
252**Goal:** Make `/events` persist generate/accept/discard and checkpoint events before they are exposed to agent polling.
253
254**Requirements:** R1, R3, R5, R7, R8
255
256**Dependencies:** U1
257
258**Files:**
259- Modify: `skill/scripts/live-server.mjs`
260- Modify: `tests/live-server.test.mjs`
261
262**Approach:**
263- On `POST /events`, validate generate/accept/discard/checkpoint events, assign or validate a durable event sequence, append to the session journal, then enqueue actionable events for polling. Checkpoints update durable state but do not necessarily create agent work.
264- Add a server-side session state machine that accepts valid transitions and treats duplicate valid events idempotently.
265- Add an at-least-once poll delivery model: pending events are leased to a poll response, redelivered after lease expiry, and marked complete only when the agent posts a matching result.
266- Keep `exit` as lower priority than real session events so queued generate/accept/discard work is not masked by tab disconnect.
267- Preserve current token validation and in-memory fast path, but make disk the replayable source.
268- On server startup, rebuild pending work from the journal into the in-memory queue. `/poll` may consult the store when memory is empty, but the journal remains canonical.
269
270**Patterns to follow:**
271- Existing `validateEvent()` and `enqueueEvent()` in `skill/scripts/live-server.mjs`.
272- Existing `/events` and `/poll` tests in `tests/live-server.test.mjs`.
273
274**Test scenarios:**
275- Happy path: `POST /events` for generate persists an event before a poll consumes it.
276- Happy path: if the server object is recreated with the same project store, `/poll` returns the previously unconsumed generate event.
277- Edge case: duplicate accept for the same session and variant returns the same durable state without creating conflicting queue entries.
278- Edge case: discard after accept-pending is rejected or ignored according to the state machine and returns a diagnostic response.
279- Error path: invalid transition returns a clear JSON error and does not append to the journal.
280- Integration: queued real events are delivered before synthetic `exit` events.
281- Integration: event delivered to a poll but never acked is redelivered after lease timeout and remains idempotent when eventually completed.
282
283**Verification:**
284- Browser events are not lost by an idle agent or helper server restart as long as the project-local journal remains.
285
286---
287
288- U3. **Add browser checkpoints and acknowledged accept/discard**
289
290**Goal:** Have the browser record current live state durably and stop clearing recoverable local state until the server acknowledges the state change.
291
292**Requirements:** R1, R2, R3, R4, R5
293
294**Dependencies:** U1, U2
295
296**Files:**
297- Modify: `skill/scripts/live-browser.js`
298- Modify: `tests/live-e2e.test.mjs`
299- Create: `tests/live-browser-recovery.test.mjs`
300
301**Approach:**
302- Add a browser checkpoint helper that sends current session state to `/events` as `type: "checkpoint"` whenever variant, parameter values, phase, or selected source file context changes.
303- Include monotonic client revisions and an active-session owner/epoch so stale checkpoints from old tabs cannot regress a newer accepted/discarded phase.
304- Change accept/discard from fire-and-forget to acknowledged durable receipt before `markSessionHandled()` and before local session cleanup. Agent completion remains a later state.
305- Keep a local pending state if acknowledgement fails, with a visible “waiting for agent/server” or “reconnect to recover” state instead of silently resetting.
306- Preserve existing `localStorage` resume as a UI fast path, but do not let its single handled key suppress recovery when source markers or durable server state indicate pending work.
307- Capture current parameter values on every change, not only at accept time, so resume can reconstruct user tuning even before Accept.
308
309**Patterns to follow:**
310- Existing `saveSession()`, `resumeSession()`, `paramsCurrentValues`, and `handleAccept()` in `skill/scripts/live-browser.js`.
311- Existing scroll restoration and MutationObserver recovery patterns in `skill/scripts/live-browser.js`.
312
313**Test scenarios:**
314- Happy path: moving a parameter slider sends a checkpoint with updated param values and does not reset the tune panel.
315- Happy path: clicking Accept receives server acknowledgement, then transitions the browser to confirmed/pending-agent state.
316- Edge case: Accept POST fails due to server down; browser keeps session recoverable and shows a reconnect/retry state.
317- Edge case: page reload after accept acknowledgement but before agent completion resumes as pending cleanup rather than picking mode.
318- Edge case: local handled flag exists but source wrapper still exists; browser resumes or reports pending cleanup instead of hiding the session.
319- Edge case: stale checkpoint from a background tab arrives after accept-pending; server rejects it or records it as stale without changing snapshot phase.
320- Integration: after HMR inserts variants, browser checkpoint records arrived variant count and visible variant.
321
322**Verification:**
323- The browser never becomes the only place where selected variant and tune values exist.
324
325---
326
327- U4. **Expose agent status and resume commands**
328
329**Goal:** Give agents a stable CLI/API to inspect and continue live state without manually reading raw journal files or source markers.
330
331**Requirements:** R2, R4, R6, R7
332
333**Dependencies:** U1, U2
334
335**Files:**
336- Create: `skill/scripts/live-status.mjs`
337- Create: `skill/scripts/live-resume.mjs`
338- Create: `skill/scripts/live-complete.mjs`
339- Modify: `skill/scripts/live-server.mjs`
340- Modify: `skill/reference/live.md`
341- Modify: `tests/live-server.test.mjs`
342- Create: `tests/live-status.test.mjs`
343
344**Approach:**
345- Add a status endpoint or direct store reader that returns active session snapshots, pending events, source marker status, and recommended next action.
346- Add `live-status.mjs` for human/agent-readable JSON status.
347- Add `live-resume.mjs` if the status command should also requeue pending events or print a normalized next event for the agent.
348- Include source-marker scanning for `impeccable-variants-start`, `impeccable-carbonize-start`, and `impeccable-param-values` so source truth can repair journal/browser drift. The scan domain should include resolved live config targets, journal-known files, and existing source roots used by marker helpers; generated/served files covered by live config must be diagnosable.
349- Consult the generated-file guard before recommending deterministic `live-accept`; generated/served-file fallback sessions should return `persist_fallback_to_true_source`.
350- Distinguish generate recovery phases: pending event only, wrapper/original-only source, variants partially/fully written, and browser not yet confirmed cycling.
351- Keep output JSON stable and agent-readable: explicit `nextAction`, `reason`, `file`, `sessionId`, `paramValues`, `lease`, and `diagnostics` fields.
352- After completion acknowledgement, completed sessions return `no_active_session` by default unless an include-completed flag is requested.
353
354**Technical design:** Directional `nextAction` values:
355
356```text
357poll_for_pending_event
358write_variants
359continue_writing_variants
360run_accept_cleanup
361run_carbonize_cleanup
362acknowledge_completion
363persist_fallback_to_true_source
364restore_discard
365ask_user_to_reopen_browser
366no_active_session
367```
368
369**Patterns to follow:**
370- `skill/scripts/live-poll.mjs` for CLI JSON output style.
371- `skill/scripts/live-accept.mjs` marker parsing helpers where reusable.
372
373**Test scenarios:**
374- Happy path: status with a pending accept event returns `run_accept_cleanup` and includes variant id and param values.
375- Happy path: status with carbonize markers in source returns `run_carbonize_cleanup` even if no browser is connected.
376- Edge case: status with no active journal but source variant markers returns a recoverable source-marker session.
377- Edge case: status with stale completed sessions returns `no_active_session` by default and can include completed sessions only when requested.
378- Edge case: source markers exist in a generated/served file covered by live config; status recommends fallback true-source persistence rather than deterministic accept.
379- Edge case: wrapper exists with only original content; status recommends continuing variant writing rather than accept cleanup.
380- Error path: missing or unreadable source file appears as a structured diagnostic, not an unhandled exception.
381- Integration: `live-resume.mjs` can requeue or emit the pending event after a helper server restart.
382
383**Verification:**
384- A fresh agent can run one command and know exactly what happened in live mode and what to do next.
385
386---
387
388- U5. **Make poll/accept acknowledge durable completion**
389
390**Goal:** Ensure agent-side handling updates durable session state after source writes and carbonize cleanup so browser and future agents know whether the session is complete.
391
392**Requirements:** R2, R5, R6
393
394**Dependencies:** U1, U2, U4
395
396**Files:**
397- Modify: `skill/scripts/live-poll.mjs`
398- Modify: `skill/scripts/live-accept.mjs`
399- Modify: `skill/scripts/live-server.mjs`
400- Modify: `tests/live-accept.test.mjs`
401- Modify: `tests/live-server.test.mjs`
402- Create: `tests/live-poll.test.mjs`
403
404**Approach:**
405- Have `live-poll.mjs` report auto-accept/discard results back to the server/session store, not only print `_acceptResult` to stdout.
406- Complete the delivery lease only after this result is durably recorded; failed source writes should release or mark the lease recoverable with diagnostics.
407- Represent `carbonize: true` as `accepted_source_pending` until cleanup is confirmed.
408- Add `live-complete.mjs` as the canonical completion acknowledgement path after carbonize cleanup. `live-status.mjs` reports that completion is needed; `live-resume.mjs` helps recover pending work; only `live-complete.mjs` marks carbonize cleanup complete.
409- Reconcile lost completion acknowledgements from source truth: if carbonize markers are gone and accepted content is materialized, status can idempotently record a synthetic completion event.
410- Make duplicate completion acknowledgements idempotent.
411- Preserve the stderr warning as a human attention signal, but do not rely on warning text as the state machine.
412
413**Patterns to follow:**
414- Current `_acceptResult` attachment in `skill/scripts/live-poll.mjs`.
415- Current carbonize marker output in `skill/scripts/live-accept.mjs`.
416
417**Test scenarios:**
418- Happy path: accept event processed by `live-poll.mjs` updates durable session to `accepted_source_pending` when carbonize is required.
419- Happy path: discard event processed by `live-poll.mjs` updates durable session to complete.
420- Happy path: a `live-poll.mjs` CLI/integration test processes an accept event and records the durable lease/result acknowledgement through the real poll script path.
421- Edge case: running accept twice for the same event returns the same final durable phase without rewriting source twice.
422- Error path: `live-accept.mjs` failure records an agent error in the session journal and leaves next action recoverable.
423- Integration: browser reload after agent accept but before carbonize can be diagnosed by `live-status.mjs`.
424- Integration: completion acknowledgement is lost after source cleanup; status detects the condition and recommends `acknowledge_completion`, then `live-complete.mjs` resolves the session complete.
425
426**Verification:**
427- Durable state reflects what happened to source, not only what browser requested.
428
429---
430
431- U6. **Update live-mode guidance and recovery UX**
432
433**Goal:** Teach both agents and users the new recovery contract and make recovery visible in live mode.
434
435**Requirements:** R2, R4, R6, R7
436
437**Dependencies:** U3, U4, U5
438
439**Files:**
440- Modify: `skill/reference/live.md`
441- Modify: `docs/adr-live-variant-mode.md`
442- Modify: `skill/scripts/live-browser.js`
443- Modify: `README.md` if live command usage is documented there
444
445**Approach:**
446- Update `reference/live.md` to start with a status check when resuming a session or when the agent suspects it missed events.
447- Document the durable journal and session states in the ADR.
448- Add concise browser states for pending accept, reconnect/recover, and agent cleanup pending.
449- Keep chat overhead low: the agent should use status JSON instead of asking the user to describe browser state.
450
451**Patterns to follow:**
452- Existing `reference/live.md` contract sections for poll loop, accept, carbonize, and cleanup.
453- Existing toast and bar state patterns in `skill/scripts/live-browser.js`.
454
455**Test scenarios:**
456- Test expectation: mostly documentation and UX copy. Behavioral coverage belongs to U3-U5; this unit should be verified through review plus any snapshot/E2E assertions added for visible pending states.
457
458**Verification:**
459- A new agent reading `reference/live.md` knows to recover state before making assumptions after interruption.
460
461---
462
463- U7. **Add restart and interruption E2E coverage**
464
465**Goal:** Prove the durable recovery contract in realistic browser/server/agent flows.
466
467**Requirements:** R1, R2, R3, R4, R5, R6
468
469**Dependencies:** U1, U2, U3, U4, U5
470
471**Files:**
472- Modify: `tests/live-e2e.test.mjs`
473- Modify: `tests/framework-fixtures/README.md` if new fixture expectations are needed
474- Modify: `tests/live-e2e/agent.mjs` if deterministic agent hooks need to simulate interruption
475
476**Approach:**
477- Add focused E2E cases for missed polling, helper server restart, browser reload, accept before agent resumes, and carbonize-pending status.
478- Prefer one or two compact fixtures over broad matrix explosion.
479- Keep deterministic fake agent path as the primary verification route; LLM agent remains opt-in.
480
481**Patterns to follow:**
482- Existing live-mode E2E setup in `tests/live-e2e.test.mjs`.
483- Fixture authoring guidance in `tests/framework-fixtures/README.md`.
484
485**Test scenarios:**
486- Integration: user clicks Go while no poll is active; later poll receives durable generate event and completes variants.
487- Integration: user changes Tune values, reloads browser, and status/resume reports the same values.
488- Integration: user clicks Accept, agent is interrupted, new agent runs status and sees `run_accept_cleanup` with variant id and params.
489- Integration: helper server restarts after queued generate; resumed server can still surface the pending event.
490- Integration: agent receives a poll event through the real `live-poll.mjs` path and then crashes before posting result; a resumed poll redelivers the leased event after timeout.
491- Integration: source contains carbonize markers; status returns cleanup pending even without an active browser tab.
492- Integration: out-of-order checkpoint after accept does not regress pending accept state.
493- Error path: duplicate accept and late discard do not corrupt final source.
494
495**Verification:**
496- E2E coverage demonstrates recovery from the exact “agent was not listening” failure mode.
497
498---
499
500## System-Wide Impact
501
502- **Interaction graph:** Browser `live-browser.js` posts events/checkpoints to `live-server.mjs`; server persists them through `live-session-store.mjs`; agent observes them through `live-poll.mjs`, `live-status.mjs`, or `live-resume.mjs`; source mutations still happen through `live-wrap.mjs` and `live-accept.mjs`.
503- **Error propagation:** Failed event persistence must be visible to the browser before it clears state. Failed source cleanup must be visible in status output and kept recoverable.
504- **State lifecycle risks:** The key risk is split-brain among browser localStorage, server journal, in-memory queue, snapshots, and source markers. The plan makes source markers plus durable journal canonical, snapshots rebuildable, in-memory queues cache-only, and browser localStorage only an optimization.
505- **API surface parity:** The new status/resume scripts become agent-facing APIs and should be reflected in `reference/live.md` and tests.
506- **Integration coverage:** Unit tests alone will not prove recovery. E2E must cover browser reload, server restart, missed poll, duplicate events, and carbonize pending states.
507- **Unchanged invariants:** Live mode remains source-first, self-contained, zero-dependency, token-protected, and transport-compatible with current AI harnesses.
508
509---
510
511## Risks & Dependencies
512
513| Risk | Mitigation |
514|------|------------|
515| Durable journal becomes another state source that can drift | Make source markers plus journal reconciliation explicit in `live-status.mjs`; keep browser localStorage advisory. |
516| Event replay causes duplicate source rewrites | Add idempotency keys, state-machine validation, leased delivery, and tests for duplicate accept/discard. |
517| Browser waits forever after accept if agent is absent | Show explicit pending/reconnect state and provide status/resume command for the agent. |
518| Poll delivery is mistaken for completion | Model leased delivery separately from completion acknowledgement and redeliver expired leases. |
519| Snapshot contradicts journal after crash | Treat journal as canonical and rebuild/repair snapshots on startup/status reads. |
520| Annotated screenshot path goes stale before recovery | Store annotation artifacts with the incomplete session and retain them until completion cleanup. |
521| Disk writes fail in restricted environments | Return structured server errors and do not clear browser state until persistence succeeds. |
522| Tests become slow or brittle | Keep most coverage in store/server tests and add only targeted E2E cases for true cross-process recovery. |
523| Backward compatibility with existing installed skill output drifts | Update source skill files first, then run the project build to regenerate provider outputs in a separate execution phase. |
524
525---
526
527## Documentation / Operational Notes
528
529- Update `docs/adr-live-variant-mode.md` because durability changes the architecture from memory queue plus localStorage to journaled sessions.
530- Update `skill/reference/live.md` so agents know to run status/resume after interruption or before assuming no pending work.
531- If generated provider skill outputs are tracked, implementation should regenerate them with the existing build process after changing `skill/`.
532- Because the default test script enumerates test files, implementation must update `package.json` when adding new non-E2E test files.
533- Consider adding `.impeccable-live/` or the chosen session-store directory to gitignore if it is not already ignored.
534
535---
536
537## Sources & References
538
539- Related architecture: `docs/adr-live-variant-mode.md`
540- Live instructions: `skill/reference/live.md`
541- Browser live implementation: `skill/scripts/live-browser.js`
542- Server transport: `skill/scripts/live-server.mjs`
543- Agent poll client: `skill/scripts/live-poll.mjs`
544- Accept/discard source cleanup: `skill/scripts/live-accept.mjs`
545- Live-mode tests: `tests/live-server.test.mjs`, `tests/live-accept.test.mjs`, `tests/live-e2e.test.mjs`
546
547---
548
549## Surgical Overhaul Follow-Ups (2026-04-29)
550
551- Dependency audit is now Bun-native via `bun run audit`, but the gate currently fails on transitive advisories:
552 - `archiver -> archiver-utils -> lodash` and `brace-expansion`
553 - optional `puppeteer -> @puppeteer/browsers -> basic-ftp`
554- Decision: keep the audit script honest and failing instead of adding ignores before exploitability and replacement cost are reviewed.
555- Recommended next dependency work: evaluate whether release ZIP creation still needs `archiver`, and whether optional Puppeteer screenshot tooling can be replaced, isolated, or upgraded without bloating install size.