VALIDATION.md

  1# API Key Validation
  2
  3This document describes how `ProviderConfig.TestConnection` proves that a user's
  4API key authenticates against a given provider, and the conventions for adding
  5or changing a provider's validation behavior.
  6
  7If you are touching `buildValidationProbe`, any `classify*` function, the
  8`openaiCompatModelsAllowlist`, or `APIKeyInputState*` in the API key dialog,
  9**you must also update this document in the same commit.**
 10
 11---
 12
 13## The problem this layer solves
 14
 15When a user enters an API key in the onboarding dialog, Crush tells them
 16whether the key is valid before saving it. The obvious implementation — make
 17an HTTP request to the provider and see if it succeeds — turns out to be
 18wrong in a specific and silent way.
 19
 20Historically, Crush validated keys by calling `GET /models` and treating a
 21`200 OK` response as proof of authentication. This works for the native
 22OpenAI and Anthropic APIs, where `/models` is auth-gated. It does **not**
 23work for most OpenAI-compatible gateways, because `/models` on those
 24services is deliberately public: it powers SDK catalogs, docs sites, and
 25model-picker UIs that render without requiring a signup. A public `/models`
 26endpoint returns `200` to every caller — valid key, invalid key, no key at
 27all — so the response proves nothing about the caller's credentials.
 28
 29The consequence was that for roughly ten providers (AiHubMix, Avian,
 30Cortecs, HuggingFace Router, io.net, OpenCode Go, OpenCode Zen, QiniuCloud,
 31Synthetic, Venice), Crush's "Key validated" message didn't actually reflect
 32whether the key would work. Any string the user typed — a typo, a
 33copy-paste from the wrong field, the literal word `test` — was reported as
 34a valid key. The failure surfaced later, when the user actually tried to
 35run the model and got an authentication error from the provider. A
 36separate bug made MiniMax and MiniMax China return "validated"
 37unconditionally, regardless of endpoint behavior.
 38
 39The fix is to stop assuming `/models` proves authentication. For each
 40provider we either:
 41
 421. Find a different endpoint that actually gates on auth (typically
 43   account-scoped data like rate limits or credits).
 442. Use the auth-gated `/chat/completions` endpoint with a deliberately
 45   malformed body, so we can tell auth-failure apart from schema-failure
 46   without running inference.
 473. Admit we cannot reliably verify the key and say so in the UI
 48   ("saved, not verified") rather than faking success.
 49
 50This is per-provider policy by necessity, because providers disagree about
 51which endpoints authenticate, which status codes they return for bad keys,
 52and whether their gateways authenticate before or after validating the
 53request body. Managing that per-provider-ness — without silently
 54regressing back into "assume `/models` proves auth" — is what the rest of
 55this document exists to do.
 56
 57---
 58
 59## Contract
 60
 61`TestConnection` returns one of three things:
 62
 63| Return value               | Meaning                                                 | UI state                                             |
 64| -------------------------- | ------------------------------------------------------- | ---------------------------------------------------- |
 65| `nil`                      | Authentication proven.                                  | `APIKeyInputStateVerified` ("validated")             |
 66| `ErrValidationUnsupported` | No deterministic probe exists. Key saved, not verified. | `APIKeyInputStateUnverified` ("saved, not verified") |
 67| Any other non-nil `error`  | Probe ran and the server rejected the key.              | `APIKeyInputStateError` ("invalid")                  |
 68
 69"Saved, not verified" is a **first-class outcome**, not a failure. It is
 70strictly preferable to a false-positive "validated" — the original bug that
 71motivated this whole machinery was the system telling users bad keys were valid
 72because `/models` returned `200` to any caller.
 73
 74---
 75
 76## Why not just hit `/models`?
 77
 78Many OpenAI-compatible gateways intentionally expose a public `/models` endpoint
 79so SDKs, docs sites, and model pickers can render a catalog without requiring
 80signup. On those providers, `GET /models` returns `200 OK` regardless of the key
 81— so `200` proves nothing about authentication.
 82
 83The fix is to pick a probe that the server's response **depends on the key**.
 84That means either:
 85
 861. Hit an endpoint the provider actually gates on auth (typically account-scoped
 87   data like rate limits or credits).
 882. Hit an auth-gated endpoint (`/chat/completions`) with an intentionally broken
 89   payload, so the server authenticates the caller before rejecting the body —
 90   without actually running inference.
 913. If neither is available, return `ErrValidationUnsupported` and let the UI
 92   fall back to "saved, not verified."
 93
 94---
 95
 96## Classifiers
 97
 98The four `classify*` functions in `config.go` cover every probe currently in
 99use. Keep this set small — prefer "use an existing classifier" over "add a new
100one."
101
102| Classifier                    | Valid → `nil`                             | Invalid → error     | Anything else              |
103| ----------------------------- | ----------------------------------------- | ------------------- | -------------------------- |
104| `classifyAuthGated`           | `200`                                     | `401`, `403`        | `ErrValidationUnsupported` |
105| `classifyOpenAIChatMalformed` | `400`, `422` (auth passed, body rejected) | `401`, `403`        | `ErrValidationUnsupported` |
106| `classifyGoogleModels`        | `200`                                     | `400`, `401`, `403` | `ErrValidationUnsupported` |
107| `classifyZAIModels`           | anything except `401`                     | `401` only          | (no unsupported bucket)    |
108
109Transient statuses (`5xx`, `429`, `402`, unexpected `200` on the chat probe)
110collapse into `ErrValidationUnsupported` on purpose — a flaky gateway should not
111surface as "your key is bad."
112
113---
114
115## Probes by provider
116
117### Auth-gated endpoint (GET)
118
119| Provider ID                | Endpoint                             | Auth                              | Classifier             |
120| -------------------------- | ------------------------------------ | --------------------------------- | ---------------------- |
121| `openai`                   | `GET {base}/models`                  | `Authorization: Bearer`           | `classifyAuthGated`    |
122| `openrouter`               | `GET {base}/credits`                 | `Authorization: Bearer`           | `classifyAuthGated`    |
123| `anthropic`                | `GET {base}/models`                  | `x-api-key` + `anthropic-version` | `classifyAuthGated`    |
124| `kimi-coding`              | `GET {base}/v1/models`               | `x-api-key` + `anthropic-version` | `classifyAuthGated`    |
125| `gemini`                   | `GET {base}/v1beta/models?key=<key>` | key in query                      | `classifyGoogleModels` |
126| `venice`                   | `GET {base}/api_keys/rate_limits`    | `Authorization: Bearer`           | `classifyAuthGated`    |
127| `minimax`, `minimax-china` | `GET {base}/v1/models`               | `x-api-key` + `anthropic-version` | `classifyAuthGated`    |
128
129### OpenAI-compat `/models` allowlist
130
131For `openai-compat` providers only. Entry in this allowlist means the provider's
132`/models` endpoint has been **empirically confirmed** to return `401` on a bad
133key:
134
135| Provider ID                                                                         | Probe                                                                                                           |
136| ----------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- |
137| `deepseek`, `groq`, `xai`, `zhipu`, `zhipu-coding`, `cerebras`, `nebius`, `copilot` | `GET {base}/models` + `Authorization: Bearer`, `classifyAuthGated`                                              |
138| `zai`                                                                               | Same probe, but uses `classifyZAIModels` (only `401` is invalid; valid keys return assorted non-`200` statuses) |
139
140### Malformed-body chat probe
141
142Used when the provider's `/models` is public but `/chat/completions` is
143auth-gated. Sends `{"__crush_probe__": true}` (missing required fields) so the
144gateway authenticates the caller before rejecting the schema. No tokens
145consumed.
146
147| Provider IDs                                                                                                     |
148| ---------------------------------------------------------------------------------------------------------------- |
149| `aihubmix`, `avian`, `cortecs`, `huggingface`, `ionet`, `opencode-go`, `opencode-zen`, `qiniucloud`, `synthetic` |
150
151All use `POST {base}/chat/completions` + `Authorization: Bearer` +
152`classifyOpenAIChatMalformed`.
153
154### Prefix check (no network)
155
156| Provider ID | Rule                                                                                                                                        |
157| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
158| `bedrock`   | Key must start with `ABSK`. Weak signal — Bedrock's `/foundation-models` endpoint is region-specific, so we fall back to format validation. |
159| `vercel`    | Key must start with `vck_`. Vercel's `/models` does not gate on auth.                                                                       |
160
161### Explicitly unverified
162
163| Provider ID                          | Why                                                                                                                                  |
164| ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------ |
165| `chutes`                             | Observed ambiguous response (`429`) on unauthenticated probe path; classifier cannot reliably distinguish bad-key from rate-limited. |
166| `neuralwatt`                         | Observed classifier ambiguity on malformed-body probe.                                                                               |
167| Any unknown `openai-compat` provider | Default fallback. We don't assume `/models` is auth-gated for providers we haven't tested.                                           |
168
169---
170
171## Adding a new provider
172
173Follow this checklist in order:
174
175### 1. Identify the provider's `Type`
176
177Check the `type` field in the catwalk provider definition:
178
179- `openai`, `anthropic`, `google`, `openrouter`, `bedrock`, `vercel`, `azure`,
180  `vertexai` — type-based default in `buildValidationProbe` already covers you.
181  Stop here unless the provider has quirks.
182- `openai-compat` — keep going.
183
184### 2. Test the provider's `/models` endpoint with a bad key
185
186```sh
187curl -i -H "Authorization: Bearer definitely-not-a-real-key" \
188  https://<provider>/v1/models
189```
190
191| Response                    | Action                                                                                              |
192| --------------------------- | --------------------------------------------------------------------------------------------------- |
193| `401` or `403`              | `/models` is auth-gated. Add the provider ID to `openaiCompatModelsAllowlist` in `config.go`. Done. |
194| `200`                       | `/models` is public. Go to step 3.                                                                  |
195| `429`, `5xx`, anything else | Ambiguous. Mark unsupported (step 5).                                                               |
196
197### 3. Test `/chat/completions` with a bad key and malformed body
198
199```sh
200curl -i -X POST \
201  -H "Authorization: Bearer definitely-not-a-real-key" \
202  -H "Content-Type: application/json" \
203  -d '{"__crush_probe__":true}' \
204  https://<provider>/v1/chat/completions
205```
206
207| Response       | Action                                                                                                 |
208| -------------- | ------------------------------------------------------------------------------------------------------ |
209| `401` or `403` | Auth gate works. Add the provider ID to the chat-probe `case` in `buildValidationProbe`. Go to step 4. |
210| `400` or `422` | Gateway validates schema before auth. `/chat/completions` is not usable as a probe. Go to step 5.      |
211| Anything else  | Mark unsupported (step 5).                                                                             |
212
213### 4. Confirm the good-key behavior
214
215Using a real key, repeat the `/chat/completions` probe with the malformed body.
216Expect `400` or `422`. If you get `200`, the gateway is not validating the body
217and the probe cannot distinguish valid from invalid keys — go to step 5 instead.
218
219### 5. Mark the provider unsupported
220
221Add the provider ID to the
222`InferenceProviderChutes, InferenceProviderNeuralwatt` case in
223`buildValidationProbe` (or extend it). The UI will show "saved (not verified)"
224and the user can still use the provider.
225
226### 6. Update this document and the tests
227
228- Add the provider to the appropriate table in the "Probes by provider" section
229  above.
230- Add a test case to `TestTestConnectionOpenAICompatProviderAudit` in
231  `config_validate_test.go` (for probe-based providers).
232- Add the provider to the appropriate list in
233  `TestTestConnectionPublicModelsAuthGatedChatRegression` (for chat-probe
234  providers) or `TestTestConnectionOpenAICompatAllowlistUsesModelsProbe` (for
235  allowlist providers).
236
237---
238
239## Decision log
240
241### Why allowlist, not default-allow, for `openai-compat` `/models`?
242
243The regression that motivated this document was exactly "default-allow." Commit
244`7d14abb9` (2025-10-20) expanded the `/models` check from `TypeOpenAI` to all
245`openai-compat` providers, silently turning every gateway with a public model
246catalog into a false-positive validator. A default-deny allowlist makes new
247providers opt in explicitly — the cost is one line per provider; the benefit is
248no more silent regressions of this shape.
249
250### Why is ZAI special?
251
252ZAI's `/models` endpoint is authoritative about bad keys (always `401`) but
253noisy about valid keys (returns assorted non-`200` statuses, seemingly depending
254on backend state). Folding it into `classifyAuthGated` would regress valid-key
255detection, since everything except `200` would be classified as
256`ErrValidationUnsupported`. The ZAI classifier treats `401` as invalid and
257everything else as valid.
258
259This is fragile — if ZAI ever changes their endpoint so `401` is no longer
260specific to auth, the classifier becomes wrong. It is documented here so the
261next person to touch it knows why it exists.
262
263### Why is MiniMax in the provider-ID switch and not the `TypeAnthropic` branch?
264
265MiniMax's base URL is `https://api.minimax.io/anthropic`, which means
266`{base}/models` resolves to `/anthropic/models` — a 404. The correct endpoint is
267`{base}/v1/models`. The `TypeAnthropic` default assumes the base URL already
268ends in `/v1`, which holds for native Anthropic but not for MiniMax. Rather than
269special-case the type branch, MiniMax gets its own explicit probe entry.
270
271A previous bug (commit `cce8edf9`, 2026-04-23) removed MiniMax's validation
272entirely by returning `nil` unconditionally. Don't reintroduce that.
273
274### Why does Google need its own classifier?
275
276Google returns `400 INVALID_ARGUMENT` for unknown API keys, not `401`. If
277`classifyAuthGated` were used, bad Google keys would produce `400` →
278`ErrValidationUnsupported` → "saved, not verified" — downgrading a real auth
279failure into a soft warning. `classifyGoogleModels` adds `400` to the invalid
280bucket specifically for this case.
281
282### Why not push all this into catwalk?
283
284Probe metadata (method, path, classifier kind) is provider-identity data and
285arguably belongs alongside the rest of the catwalk provider definition. The
286classifier functions and UI mapping are crush-specific behavior and should stay
287here.
288
289This refactor hasn't been done yet because the set of providers and classifiers
290is still small enough that the indirection cost (catwalk schema change, version
291bump, fallback for older catwalks) outweighs the benefit. If the allowlist or
292classifier set grows significantly, revisit this decision.
293
294---
295
296## Regression history
297
298| Date       | Commit     | Effect                                                                                                                                                              |
299| ---------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
300| 2025-10-20 | `7d14abb9` | Expanded `TestConnection` from `openai` to `openai-compat` with generic `/models` probe. Silently false-validated bad keys for ~10 providers with public `/models`. |
301| 2026-04-23 | `cce8edf9` | Removed MiniMax's format-prefix guard, causing `TestConnection` to return `nil` unconditionally for MiniMax / MiniMax China.                                        |
302| (current)  | (this PR)  | Replaced generic `/models` with per-provider probes, added `ErrValidationUnsupported` sentinel, wired UI "saved (not verified)" state.                              |
303
304Any future regression in this layer should be recorded here.