Don't auto-retry in certain circumstances (#35037)

Richard Feldman created

Someone encountered this in production, which should not happen:

<img width="1266" height="623" alt="Screenshot 2025-07-24 at 10 38
40 AM"
src="https://github.com/user-attachments/assets/40f3f977-5110-4808-a456-7e708d953b3b"
/>

This moves certain errors into the category of "never retry" and reduces
the number of retries for some others. Also it adds some diagnostic
logging for retry policy.

It's not a complete fix for the above, because the underlying issues is
that the server is sending a HTTP 403 response and although we were
already treating 403s as "do not retry" it was deciding to retry with 2
attempts anyway. So further debugging is needed to figure out why it
wasn't going down the 403 branch by the time the request got here.

Release Notes:

- N/A

Change summary

crates/agent/src/thread.rs | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)

Detailed changes

crates/agent/src/thread.rs 🔗

@@ -2037,6 +2037,12 @@ impl Thread {
                                         if let Some(retry_strategy) =
                                             Thread::get_retry_strategy(completion_error)
                                         {
+                                            log::info!(
+                                                "Retrying with {:?} for language model completion error {:?}",
+                                                retry_strategy,
+                                                completion_error
+                                            );
+
                                             retry_scheduled = thread
                                                 .handle_retryable_error_with_delay(
                                                     &completion_error,
@@ -2246,15 +2252,14 @@ impl Thread {
                 ..
             }
             | AuthenticationError { .. }
-            | PermissionError { .. } => None,
-            // These errors might be transient, so retry them
-            SerializeRequest { .. }
-            | BuildRequestBody { .. }
-            | PromptTooLarge { .. }
+            | PermissionError { .. }
+            | NoApiKey { .. }
             | ApiEndpointNotFound { .. }
-            | NoApiKey { .. } => Some(RetryStrategy::Fixed {
+            | PromptTooLarge { .. } => None,
+            // These errors might be transient, so retry them
+            SerializeRequest { .. } | BuildRequestBody { .. } => Some(RetryStrategy::Fixed {
                 delay: BASE_RETRY_DELAY,
-                max_attempts: 2,
+                max_attempts: 1,
             }),
             // Retry all other 4xx and 5xx errors once.
             HttpResponseError { status_code, .. }