如何写好Evaluator?
更新时间:2025-11-12
Evaluator是什么
Evaluator 是一个 评估函数,用于量化一个候选解的“好坏”。在演化优化过程中,系统会生成大量解,每个解都会传入 evaluator 中被打分,系统通过每个算法的得分来选择优秀解、迭代优化。
为什么这个重要
- 目标定义器:它是你问题目标的“翻译器”,把“业务目标”变成“数学目标”。
- 进化引导器:它决定了优化过程的“方向”,也影响最终收敛结果。
- 高度可定制:不同业务场景需要不同 evaluator,因为不同任务的优化方向和目标都不同。
参考模版
如果您对设计evaluator仍有疑问,可以参考以下的circle packing问题的评估器模版并针对任务进行修改:
Plain Text
1"""
2Evaluator skeleton for code evaluation tasks
3"""
4
5import json
6import time
7import os
8import sys
9import subprocess
10import tempfile
11import pickle
12import traceback
13from pathlib import Path
14
15
16def validate_constraints(result) -> bool:
17 """
18 Judge whether the result meets the constraints.
19
20 Args:
21 result: The output from user's code execution
22
23 Returns:
24 True if valid, False otherwise
25 """
26 # TODO: Implement constraint validation logic
27 # Example checks:
28 # - Check output format and types
29 # - Verify solution meets problem requirements
30 # - Validate numerical constraints
31 return True
32
33
34def run_with_timeout(program_path: str, timeout_seconds: int):
35 """
36 Run the program in a separate process with timeout.
37
38 Args:
39 program_path: Path to the program file
40 timeout_seconds: Maximum execution time in seconds
41
42 Returns:
43 The result of program execution
44 """
45 # Create temporary execution environment
46 with tempfile.NamedTemporaryFile(suffix=".py", delete=False) as temp_file:
47 # TODO: Customize the execution script for your specific task
48 script = """
49# Template for subprocess execution
50import sys
51import pickle
52import traceback
53
54try:
55 # Import user program
56 # TODO: Adapt import mechanism for your task
57 spec = __import__('importlib.util').util.spec_from_file_location("program", '{program_path}')
58 program = __import__('importlib.util').util.module_from_spec(spec)
59 spec.loader.exec_module(program)
60
61 # Execute main function
62 # TODO: Replace with your task's entry function
63 result = program.main_function()
64
65 # Save results
66 with open('{results_path}', 'wb') as f:
67 pickle.dump({{'result': result}}, f)
68
69except Exception as e:
70 # Save error information
71 with open('{results_path}', 'wb') as f:
72 pickle.dump({{'error': str(e), 'traceback': traceback.format_exc()}}, f)
73 sys.exit(0)
74"""
75 temp_file.write(script.encode())
76 temp_file_path = temp_file.name
77
78 results_path = f"{temp_file_path}.results"
79
80 try:
81 # Execute with timeout
82 process = subprocess.Popen(
83 [sys.executable, temp_file_path],
84 stdout=subprocess.PIPE,
85 stderr=subprocess.PIPE
86 )
87
88 stdout, stderr = process.communicate(timeout=timeout_seconds)
89 exit_code = process.returncode
90
91 # Process results
92 if os.path.exists(results_path):
93 with open(results_path, "rb") as f:
94 results = pickle.load(f)
95
96 if "error" in results:
97 raise RuntimeError(f"Execution failed: {results['error']}")
98 return results["result"]
99
100 # Handle missing results
101 if exit_code != 0:
102 raise RuntimeError(f"Process failed with exit code {exit_code}")
103 else:
104 raise RuntimeError("Results file not found")
105
106 except subprocess.TimeoutExpired:
107 process.kill()
108 process.wait()
109 raise TimeoutError(f"Timeout after {timeout_seconds} seconds")
110
111 finally:
112 # Cleanup
113 for path in [temp_file_path, results_path]:
114 if os.path.exists(path):
115 os.unlink(path)
116
117
118def calculate_metrics(result, eval_time: float) -> dict:
119 """
120 Calculate evaluation metrics based on the result.
121
122 Args:
123 result: The output from user's code execution
124 eval_time: Execution time in seconds
125
126 Returns:
127 Dictionary of metrics (must include validity and combined_score)
128 """
129 # TODO: Implement your specific metric calculations
130
131 # Validate the result
132 validity = 1.0 if validate_constraints(result) else 0.0
133
134 # TODO: Calculate task-specific metrics
135 performance_score = 0.0 # Replace with actual performance measurement
136
137 # Combined score - customize based on your task requirements
138 combined_score = performance_score * validity
139
140 return {
141 "validity": float(validity),
142 "combined_score": float(combined_score),
143 "eval_time": float(eval_time),
144 # TODO: Add your task-specific metrics
145 "performance": float(performance_score),
146 }
147
148
149def evaluate(program_path: str) -> dict:
150 """
151 Main evaluation function.
152
153 Args:
154 program_path: Path to the user's program file
155
156 Returns:
157 Dictionary containing all evaluation metrics
158 """
159 try:
160 # Execute user program with timeout
161 start_time = time.time()
162 result = run_with_timeout(program_path, timeout_seconds=600)
163 end_time = time.time()
164 eval_time = end_time - start_time
165
166 # Calculate metrics
167 metrics = calculate_metrics(result, eval_time)
168 return metrics
169
170 except Exception as e:
171 # Catch any exceptions and return default scores
172 print(f"Evaluation failed: {str(e)}")
173 traceback.print_exc()
174
175 return {
176 "validity": 0.0,
177 "combined_score": 0.0,
178 "eval_time": 0.0,
179 "performance": 0.0,
180 "error_info": {"run_error": str(e)}
181 }
182
183
184if __name__ == "__main__":
185 if len(sys.argv) != 2:
186 print("Usage: python evaluator.py <path_to_program_or_json>")
187 sys.exit(1)
188
189 input_path = sys.argv[1]
190
191 # Handle different input types
192 if input_path.endswith('.json'):
193 # Extract code from JSON
194 with open(input_path, 'r', encoding='utf-8') as f:
195 data = json.load(f)
196 code = data["code"]
197
198 # Write to temporary file
199 temp_program = "temp_user_program.py"
200 with open(temp_program, 'w', encoding='utf-8') as f:
201 f.write(code)
202 program_to_evaluate = temp_program
203 else:
204 # Direct Python file
205 program_to_evaluate = input_path
206
207 # Run evaluation
208 result = evaluate(program_to_evaluate)
209
210 # Cleanup
211 if input_path.endswith('.json') and os.path.exists(temp_program):
212 os.unlink(temp_program)
213
214 print("--- EVALUATION RESULT ---")
215 print(json.dumps(result, indent=2))
一个完整的 Evaluator 包括四个核心部分:
- 约束校验(validate_constraints):检查候选解是否满足任务规则。
示例:
Plain Text
1def validate_constraints(result):
2 # 长度正确、格式正确、值域合法等
3 if not isinstance(result, list):
4 return False
5 if len(result) != 16:
6 return False
7 return True
- 安全执行(run_with_timeout):平台会在隔离环境中执行候选程序,并设置统一的运行时间上限(默认600秒,可根据具体任务调整)。
Evaluator 会自动: 在子进程中执行用户代码 超时自动终止 * 捕获报错与输出结果
- 指标计算(calculate_metrics):根据候选解输出计算最终评分。必须返回一个包含 4 个关键字段的字典:
Plain Text
1return {
2 "validity": float(0.0 or 1.0), # 是否有效
3 "combined_score": float(score), # 综合评分,系统排序依据
4 "eval_time": float(eval_time), # 执行时长
5 "performance": float(perf) # 任务自定义性能指标
6}
通常 combined_score 取值为[0, 1],任何无效情况都应被量化为0.0,且需要考虑到有效性:combined_score = performance * validity
- 主入口(evaluate):将执行、评估、错误处理统一封装,平台会调用此接口:
Plain Text
1def evaluate(path_user_py):
2 result, eval_time, error = run_with_timeout(program_path)
3 ...
4 metrics = calculate_metrics(result, eval_time)
5 return metrics
设计Evaluator的Tips:
- 必须返回标准化的评分字典,并且combined_score需要越大越好(我们基于更高的分数进行优化)
Plain Text
1return {
2 "validity": float(0.0 or 1.0), # 是否有效
3 "combined_score": float(score), # 综合评分,系统排序依据
4 "eval_time": float(eval_time), # 执行时长
5 "performance": float(perf) # 任务自定义性能指标
6}
以下字段缺一不可:
validitycombined_scoreeval_time
这是系统进行排序、汇总、演化的根基,而且返回信息越详细越好,尤其是代码执行报错信息,如果存在请尽可能返回详细的报错信息。
- 候选程序必须包含默认入口函数
平台会运行候选解中的:
Plain Text
1def main_function():
2 return result
因此用户提交的代码必须定义该函数,否则 Evaluator 会报错。如需改用其他函数名,可在模板中统一修改。
- 评估时间不能超过系统限制(默认 600 秒)。超时会导致:本次评估 invalid、返回错误信息
