fix(db): prevent SQLITE_NOTADB corruption under concurrent sub-agents (#2690)
Greg Slepak
created
Crush could fail to start with "file is not a database (26)" after a
session ran many sub-agents in parallel. The shared *sql.DB was opened
with Go's default unlimited connection pool, so each concurrent sub-agent
(session/message/cost writes via coordinator.runSubAgent) could open its
own SQLite handle. Under load, WAL frames from different handles
interleaved and auto-checkpoints could race; if the process was cancelled
or killed mid-checkpoint, the main DB header and WAL desynced, producing
SQLITE_NOTADB (26) on the next open.
Two changes address this:
- Set db.SetMaxOpenConns(1). SQLite serializes writes at the file level
anyway, so extra pool connections add no throughput, only race surface.
Forcing a single pooled connection makes database/sql queue all callers
through one SQLite handle, eliminating cross-handle WAL/checkpoint
races. Sub-agents still function; their writes just serialize (which
SQLite was going to do regardless).
- Add _txlock=immediate to both driver DSNs (modernc.org/sqlite and
ncruces/go-sqlite3). BEGIN IMMEDIATE acquires the reserved-writer lock
up front so concurrent writers queue cleanly on busy_timeout instead of
racing the deferred-to-writer upgrade (the pattern behind prior
SQLITE_BUSY reports, e.g. #2129, #2576). Defense in depth alongside the
pool limit.
For the ncruces driver the DSN must use the file: URI prefix for query
params to be parsed; the plain path form silently ignores _txlock.
@@ -50,6 +50,13 @@ func Connect(ctx context.Context, dataDir string) (*sql.DB, error) {
return nil, err
}
+ // Serialize all access through a single connection. SQLite serializes
+ // writes at the file level anyway, and allowing multiple pool
+ // connections to interleave writes/checkpoints (especially under
+ // concurrent sub-agents) has caused WAL/header desync resulting in
+ // SQLITE_NOTADB (26) on the next open.
+ db.SetMaxOpenConns(1)
+
if err = db.PingContext(ctx); err != nil {
db.Close()
return nil, fmt.Errorf("failed to connect to database: %w", err)