ep: Rework `ep repair` to use original teacher prompt (#49335)

Oleksiy Syvokon created 1 month ago

It now creates multi-turn conversation, where first two messages are the
original teacher prompt and output.

This way, we'll have changes in teacher prompt automatically applied to
repair without having to explicitly sync them.


Release Notes:

- N/A

Change summary

crates/edit_prediction_cli/src/prompts/repair.md | 117 ++++-------------
crates/edit_prediction_cli/src/repair.rs         | 109 +++++++++------
2 files changed, 92 insertions(+), 134 deletions(-)

Detailed changes

crates/edit_prediction_cli/src/prompts/repair.md 🔗

@@ -1,103 +1,42 @@
-# Instructions
-
-You are an edit prediction assistant in a code editor. Your task is to generate an improved prediction based on feedback from a quality assessment.
-
-A previous model generated a prediction that was judged to have issues. Your job is to generate a better prediction that addresses the feedback.
-
-## Focus on
-
-- Completing any partially-applied changes made
-- Ensuring consistency with the programming style and patterns already established
-- Making edits that maintain or improve code quality
-- NOT reverting or undoing changes the user intentionally made
-
-## Rules
-
-- **NEVER undo or revert the user's recent edits.** Examine the diff in the edit history carefully:
-  - If a line was removed (starts with `-`), do NOT restore that content—even if the code now appears incomplete or broken without it
-  - If a line was added (starts with `+`), do NOT delete or significantly modify it
-  - If code appears broken or incomplete after the user's edit, output `NO_EDITS` rather than "fixing" it by reverting
-  - Only add NEW content that extends the user's work forward; never restore what they removed
-  - **Key test**: if your prediction would make the code more similar to what it was BEFORE the user's edit, output `NO_EDITS` instead
-  - **Never assume a deletion was accidental.** Even if removing content breaks the code, breaks a pattern, or leaves text looking "incomplete", respect it. The user may be mid-rewrite. Do NOT "complete" partial text by restoring what was deleted.
-- Do not just mechanically apply patterns - reason about what changes make sense given the context and the programmer's apparent goals.
-- Do not just fix syntax errors - look for the broader refactoring pattern and apply it systematically throughout the code.
-- Keep existing formatting unless it's absolutely necessary
-- When edit history and surrounding code suggest different edits, prioritize the most recent edits in the history as they best reflect current intent.
-- When uncertain, predict only the minimal, high-confidence portion of the edit. Prefer a small, correct prediction over a large, speculative one.
-- Don't write a lot of code if you're not sure what to do
-- Do not delete or remove text that was just added in the edit history. If a recent edit introduces incomplete or incorrect code, finish or fix it in place, or simply output `NO_EDITS` rather than removing it. Only remove a recent edit if the history explicitly shows the user undoing it themselves.
-- Treat partial text at or near the cursor as the beginning of something the user is actively typing. Complete the code the user appears to be creating based on context.
-
-# Input Format
-
-You will be provided with:
-1. The user's *edit history*, in chronological order. Use this to infer the user's trajectory and predict the next most logical edit.
-2. A set of *related excerpts* from the user's codebase. Some of these may be needed for correctly predicting the next edit.
-   - `…` may appear within a related file to indicate that some code has been skipped.
-3. An excerpt from the user's *current file*.
-    - Within the user's current file, there is an *editable region* delimited by the `<|editable_region_start|>` and `<|editable_region_end|>` tags. You can only predict edits in this region.
-    - The `<|user_cursor|>` tag marks the user's current cursor position, as it stands after the last edit in the history.
-4. The *previous prediction* that was generated and needs improvement.
-5. *Quality feedback* explaining why the previous prediction was problematic.
-
-# Output Format
-
-- Briefly explain what was wrong with the previous prediction and how you'll improve it.
-- Output the entire editable region, applying the edits that you predict the user will make next.
-- If you're unsure about some portion of the next edit, you may still predict the surrounding code (such as a function definition, `for` loop, etc) and place the `<|user_cursor|>` within it for the user to fill in.
-- Wrap the edited code in a codeblock with exactly five backticks.
-- There are two special outputs for when you don't want to generate a new prediction. **These have different meanings — use the correct one:**
-
-  1. **`NO_EDITS`** — The code is already complete and correct as-is. No edits should be made at all. The editable region should remain unchanged. Use this when:
-     - The code needs no modifications whatsoever
-     - Any prediction would revert or undo the user's intentional changes
-     - You are unsure what edit to make and prefer to do nothing
-
-     `````
-     NO_EDITS
-     `````
-
-  2. **`KEEP_PREVIOUS`** — The previous prediction was actually correct and should be used as-is. Use this when:
-     - After reviewing the quality feedback, you determine the previous prediction is good
-     - You cannot find a meaningful improvement over the previous prediction
-     - The quality feedback was too cautious and the previous prediction correctly addresses the user's intent
-
-     `````
-     KEEP_PREVIOUS
-     `````
-
-  **Important:** `NO_EDITS` and `KEEP_PREVIOUS` are NOT interchangeable.
-  - `NO_EDITS` means "make zero changes to the code" (empty prediction).
-  - `KEEP_PREVIOUS` means "the previous prediction is correct, use it" (reuse the previous prediction).
-  - If you believe the previous prediction was correct, you MUST use `KEEP_PREVIOUS`, not `NO_EDITS`. Using `NO_EDITS` would discard the previous prediction entirely.
-
-# 1. User Edits History
+# Repair Request
+
+Your previous prediction has quality issues that need to be addressed. Please generate an improved prediction.
+
+## Quality Feedback
+
+{quality_feedback}
+
+## Your Previous Prediction (word-diff format)
 
 `````
-{edit_history}
+{actual_patch_word_diff}
 `````
 
-# 2. Related excerpts
+## Instructions
 
-{context}
+Generate an improved prediction following the same rules and output format from the original instructions. The key rules remain:
 
-# 3. Current File
+- **NEVER undo or revert the user's recent edits** — if a line was removed in the edit history, do NOT restore it
+- If your prediction would make the code more similar to what it was BEFORE the user's edit, output `NO_EDITS` instead
+- When uncertain, predict only the minimal, high-confidence portion of the edit
 
-{cursor_excerpt}
+## Output Format
 
-# 4. Previous Prediction (needs improvement)
+Follow the same output format as before, with one addition:
 
-The previous model generated the following edit (in word-diff format):
+- If the code is complete as-is and no edits should be made, output `NO_EDITS`
+- **NEW: If your previous prediction was actually correct** (the quality feedback was overly cautious), output `KEEP_PREVIOUS`:
 
-`````
-{actual_patch_word_diff}
-`````
+  `````
+  KEEP_PREVIOUS
+  `````
 
-# 5. Quality Feedback
+  Use `KEEP_PREVIOUS` when you determine the original prediction correctly addresses the user's intent despite the feedback.
 
-{quality_feedback}
+**Important:** `NO_EDITS` and `KEEP_PREVIOUS` are NOT interchangeable:
+- `NO_EDITS` = make zero changes to the code (discard the previous prediction)
+- `KEEP_PREVIOUS` = the previous prediction is correct, use it as-is
 
-# Your Improved Prediction
+## Your Improved Prediction
 
-Based on the feedback above, generate an improved prediction. Address the issues identified in the quality feedback. If the previous prediction was actually correct, output `KEEP_PREVIOUS`. If no edits should be made at all, output `NO_EDITS`.
+Briefly explain what was wrong with your previous prediction (or why it was actually correct), then provide the improved output.

crates/edit_prediction_cli/src/repair.rs 🔗

@@ -10,7 +10,7 @@ use crate::{
     BatchProvider, PredictionProvider,
     anthropic_client::AnthropicClient,
     example::{ActualCursor, Example, ExamplePrediction},
-    format_prompt::{TeacherPrompt, extract_cursor_excerpt_from_example, extract_last_codeblock},
+    format_prompt::{TeacherPrompt, extract_last_codeblock},
     openai_client::OpenAiClient,
     parse_output::run_parse_output,
     paths::LLM_CACHE_DB,
@@ -148,16 +148,15 @@ fn build_score_feedback(example: &Example) -> Option<String> {
     Some(feedback)
 }
 
-/// Build the repair prompt for an example that needs improvement.
-pub fn build_repair_prompt(example: &Example) -> Result<String> {
+/// Build the repair message (Turn 3) for a multi-turn conversation.
+///
+/// This message is sent after the original teacher prompt (Turn 1) and
+/// teacher response (Turn 2) to request an improved prediction.
+pub fn build_repair_message(example: &Example) -> Result<String> {
     let prediction = example
         .predictions
         .first()
         .context("no predictions available")?;
-    let prompt_inputs = example
-        .prompt_inputs
-        .as_ref()
-        .context("prompt_inputs missing (run context retrieval first)")?;
     let actual_patch = prediction
         .actual_patch
         .as_ref()
@@ -169,35 +168,8 @@ pub fn build_repair_prompt(example: &Example) -> Result<String> {
 
     let actual_patch_word_diff = unified_to_word_diff(actual_patch);
 
-    let mut edit_history = String::new();
-    for event in &prompt_inputs.edit_history {
-        match event.as_ref() {
-            zeta_prompt::Event::BufferChange {
-                path,
-                old_path,
-                diff,
-                predicted: _,
-                in_open_source_repo: _,
-            } => {
-                edit_history.push_str(&format!("--- a{}\n", old_path.display()));
-                edit_history.push_str(&format!("+++ b{}\n", path.display()));
-                let diff_word_diff = unified_to_word_diff(diff);
-                edit_history.push_str(&diff_word_diff);
-                edit_history.push_str("\n\n");
-            }
-        }
-    }
-
-    let context = TeacherPrompt::format_context(example);
-
-    let cursor_excerpt =
-        extract_cursor_excerpt_from_example(example).context("failed to extract cursor excerpt")?;
-
     let prompt_template = crate::prompt_assets::get_prompt("repair.md");
     Ok(prompt_template
-        .replace("{edit_history}", &edit_history)
-        .replace("{context}", &context)
-        .replace("{cursor_excerpt}", &cursor_excerpt)
         .replace("{actual_patch_word_diff}", &actual_patch_word_diff)
         .replace("{quality_feedback}", &quality_feedback))
 }
@@ -266,6 +238,12 @@ static OPENAI_CLIENT_BATCH: OnceLock<OpenAiClient> = OnceLock::new();
 static OPENAI_CLIENT_PLAIN: OnceLock<OpenAiClient> = OnceLock::new();
 
 /// Run repair for a single example.
+///
+/// This sends a multi-turn conversation to the LLM:
+/// - Turn 1 (User): Original teacher prompt
+/// - Turn 2 (Assistant): Original teacher response
+/// - Turn 3 (User): Repair critique and instructions
+/// - Turn 4 (Assistant): Improved prediction (the response we parse)
 pub async fn run_repair(
     example: &mut Example,
     args: &RepairArgs,
@@ -289,10 +267,20 @@ pub async fn run_repair(
         anyhow::bail!("no predictions available (run predict first)");
     }
 
+    let teacher_prompt = example
+        .prompt
+        .as_ref()
+        .context("prompt missing (run format_prompt first)")?;
+
+    let teacher_response = &example.predictions[0].actual_output;
+    if teacher_response.is_empty() {
+        anyhow::bail!("teacher response is empty (run predict first)");
+    }
+
     let step_progress = example_progress.start(Step::Repair);
 
     let model = model_for_backend(args.backend);
-    let prompt = build_repair_prompt(example).context("Failed to build repair prompt")?;
+    let repair_message = build_repair_message(example).context("Failed to build repair message")?;
 
     step_progress.set_substatus("generating");
 
@@ -309,13 +297,32 @@ pub async fn run_repair(
                 })
             };
 
-            let messages = vec![anthropic::Message {
-                role: anthropic::Role::User,
-                content: vec![anthropic::RequestContent::Text {
-                    text: prompt,
-                    cache_control: None,
-                }],
-            }];
+            let messages = vec![
+                // Turn 1: Original teacher prompt
+                anthropic::Message {
+                    role: anthropic::Role::User,
+                    content: vec![anthropic::RequestContent::Text {
+                        text: teacher_prompt.input.clone(),
+                        cache_control: None,
+                    }],
+                },
+                // Turn 2: Original teacher response
+                anthropic::Message {
+                    role: anthropic::Role::Assistant,
+                    content: vec![anthropic::RequestContent::Text {
+                        text: teacher_response.clone(),
+                        cache_control: None,
+                    }],
+                },
+                // Turn 3: Repair critique and instructions
+                anthropic::Message {
+                    role: anthropic::Role::User,
+                    content: vec![anthropic::RequestContent::Text {
+                        text: repair_message,
+                        cache_control: None,
+                    }],
+                },
+            ];
 
             let Some(response) = client.generate(model, 16384, messages, None, false).await? else {
                 return Ok(());
@@ -341,9 +348,21 @@ pub async fn run_repair(
                 })
             };
 
-            let messages = vec![open_ai::RequestMessage::User {
-                content: open_ai::MessageContent::Plain(prompt),
-            }];
+            let messages = vec![
+                // Turn 1: Original teacher prompt
+                open_ai::RequestMessage::User {
+                    content: open_ai::MessageContent::Plain(teacher_prompt.input.clone()),
+                },
+                // Turn 2: Original teacher response
+                open_ai::RequestMessage::Assistant {
+                    content: Some(open_ai::MessageContent::Plain(teacher_response.clone())),
+                    tool_calls: vec![],
+                },
+                // Turn 3: Repair critique and instructions
+                open_ai::RequestMessage::User {
+                    content: open_ai::MessageContent::Plain(repair_message),
+                },
+            ];
 
             let Some(response) = client.generate(model, 16384, messages, None, false).await? else {
                 return Ok(());