templates_eval.rs

  1#[derive(Clone, Debug)]
  2pub struct Template {
  3    pub name: &'static str,
  4    pub content: &'static str,
  5}
  6
  7pub fn all_templates() -> Vec<Template> {
  8    vec![
  9        Template {
 10            name: "ProjectCreation",
 11            content: r#"
 12# Project Creation Evaluation Template
 13
 14## Instructions
 15
 16Evaluate how well the AI assistant created a new implementation from scratch. Score it between 0.0 and 1.0 based on quality and fulfillment of requirements.
 17- 1.0 = Perfect implementation that creates all necessary files with correct functionality.
 18- 0.0 = Completely fails to create working files or meet requirements.
 19
 20Note: A git diff output is required. If no code changes are provided (i.e., no git diff output), the score must be 0.0.
 21
 22## Evaluation Criteria
 23
 24Please consider the following aspects in order of importance:
 25
 261. **File Creation (25%)**
 27   - Did the assistant create all necessary files?
 28   - Are the files appropriately named and organized?
 29   - Did the assistant create a complete solution without missing components?
 30
 312. **Functional Correctness (40%)**
 32   - Does the implementation fulfill all specified requirements?
 33   - Does it handle edge cases properly?
 34   - Is it free of logical errors and bugs?
 35   - Do all components work together as expected?
 36
 373. **Code Quality (20%)**
 38   - Is the code well-structured, readable and well-documented?
 39   - Does it follow language-specific best practices?
 40   - Is there proper error handling?
 41   - Are naming conventions clear and consistent?
 42
 434. **Architecture Design (15%)**
 44   - Is the code modular and extensible?
 45   - Is there proper separation of concerns?
 46   - Are appropriate design patterns used?
 47   - Is the overall architecture appropriate for the requirements?
 48
 49## Input
 50
 51Requirements:
 52<!-- ```requirements go here``` -->
 53
 54Reference Implementation:
 55<!-- ```reference code goes here``` -->
 56
 57AI-Generated Implementation (git diff output):
 58<!-- ```git diff goes here``` -->
 59
 60## Output Format
 61
 62THE ONLY OUTPUT SHOULD BE A SCORE BETWEEN 0.0 AND 1.0.
 63
 64EXAMPLE ONE:
 65
 660.92
 67
 68EXAMPLE TWO:
 69
 700.85
 71
 72EXAMPLE THREE:
 73
 740.78
 75"#,
 76        },
 77        Template {
 78            name: "CodeModification",
 79            content: r#"
 80# Code Modification Evaluation Template
 81
 82## Instructions
 83
 84Evaluate how well the AI assistant modified existing code to meet requirements. Score between 0.0 and 1.0 based on quality and appropriateness of changes.
 85- 1.0 = Perfect modifications that correctly implement all requirements.
 86- 0.0 = Failed to make appropriate changes or introduced serious errors.
 87
 88## Evaluation Criteria
 89
 90Please consider the following aspects in order of importance:
 91
 921. **Functional Correctness (50%)**
 93   - Do the modifications correctly implement the requirements?
 94   - Did the assistant modify the right files and code sections?
 95   - Are the changes free of bugs and logical errors?
 96   - Do the modifications maintain compatibility with existing code?
 97
 982. **Modification Approach (25%)**
 99   - Are the changes minimal and focused on what needs to be changed?
100   - Did the assistant avoid unnecessary modifications?
101   - Are the changes integrated seamlessly with the existing codebase?
102   - Did the assistant preserve the original code style and patterns?
103
1043. **Code Quality (15%)**
105   - Are the modifications well-structured and documented?
106   - Do they follow the same conventions as the original code?
107   - Is there proper error handling in the modified code?
108   - Are the changes readable and maintainable?
109
1104. **Solution Completeness (10%)**
111   - Do the modifications completely address all requirements?
112   - Are there any missing changes or overlooked requirements?
113   - Did the assistant consider all necessary edge cases?
114
115## Input
116
117Original:
118<!-- ```reference code goes here``` -->
119
120New (git diff output):
121<!-- ```git diff goes here``` -->
122
123## Output Format
124
125THE ONLY OUTPUT SHOULD BE A SCORE BETWEEN 0.0 AND 1.0.
126
127EXAMPLE ONE:
128
1290.92
130
131EXAMPLE TWO:
132
1330.85
134
135EXAMPLE THREE:
136
1370.78
138"#,
139        },
140        Template {
141            name: "ConversationalGuidance",
142            content: r#"
143# Conversational Guidance Evaluation Template
144
145## Instructions
146
147Evaluate the quality of the AI assistant's conversational guidance and score it between 0.0 and 1.0.
148- 1.0 = Perfect guidance with ideal information gathering, clarification, and advice without writing code.
149- 0.0 = Completely unhelpful, inappropriate guidance, or wrote code when it should not have.
150
151## Evaluation Criteria
152
153ABSOLUTE REQUIREMENT:
154   - The assistant should NOT generate complete code solutions in conversation mode.
155   - If the git diff shows the assistant wrote complete code, the score should be significantly reduced.
156
1571. **Information Gathering Effectiveness (30%)**
158   - Did the assistant ask relevant and precise questions?
159   - Did it efficiently narrow down the problem scope?
160   - Did it avoid unnecessary or redundant questions?
161   - Was questioning appropriately paced and contextual?
162
1632. **Conceptual Guidance (30%)**
164   - Did the assistant provide high-level approaches and strategies?
165   - Did it explain relevant concepts and algorithms?
166   - Did it offer planning advice without implementing the solution?
167   - Did it suggest a structured approach to solving the problem?
168
1693. **Educational Value (20%)**
170   - Did the assistant help the user understand the problem better?
171   - Did it provide explanations that would help the user learn?
172   - Did it guide without simply giving away answers?
173   - Did it encourage the user to think through parts of the problem?
174
1754. **Conversation Quality (20%)**
176   - Was the conversation logically structured and easy to follow?
177   - Did the assistant maintain appropriate context throughout?
178   - Was the interaction helpful without being condescending?
179   - Did the conversation reach a satisfactory conclusion with clear next steps?
180
181## Input
182
183Initial Query:
184<!-- ```query goes here``` -->
185
186Conversation Transcript:
187<!-- ```transcript goes here``` -->
188
189Git Diff:
190<!-- ```git diff goes here``` -->
191
192## Output Format
193
194THE ONLY OUTPUT SHOULD BE A SCORE BETWEEN 0.0 AND 1.0.
195
196EXAMPLE ONE:
197
1980.92
199
200EXAMPLE TWO:
201
2020.85
203
204EXAMPLE THREE:
205
2060.78
207"#,
208        },
209    ]
210}