如何写好Evaluator?

更新时间：2025-11-19

Evaluator是什么

Evaluator是一个评估函数，用于量化一个候选解的“好坏”。在演化优化过程中，系统会生成大量解，每个解都会传入 evaluator 中被打分，系统通过每个算法的得分来选择优秀解、迭代优化。

为什么这个重要

目标定义器：它是你问题目标的“翻译器”，把“业务目标”变成“数学目标”。
进化引导器：它决定了优化过程的“方向”，也影响最终收敛结果。
高度可定制：不同业务场景需要不同 evaluator，因为不同任务的优化方向和目标都不同。

参考模版

如果您对设计evaluator仍有疑问，可以参考以下的circle packing问题的评估器模版并针对任务进行修改：

Plain Text

1"""
2Evaluator skeleton for code evaluation tasks
3"""
4
5import json
6import time
7import os
8import sys
9import subprocess
10import tempfile
11import pickle
12import traceback
13from pathlib import Path
14
15
16def validate_constraints(result) -> bool:
17    """
18    Judge whether the result meets the constraints.
19    
20    Args:
21        result: The output from user's code execution
22        
23    Returns:
24        True if valid, False otherwise
25    """
26    # TODO: Implement constraint validation logic
27    # Example checks:
28    # - Check output format and types
29    # - Verify solution meets problem requirements
30    # - Validate numerical constraints
31    return True
32
33
34def run_with_timeout(program_path: str, timeout_seconds: int):
35    """
36    Run the program in a separate process with timeout.
37    
38    Args:
39        program_path: Path to the program file
40        timeout_seconds: Maximum execution time in seconds
41        
42    Returns:
43        The result of program execution
44    """
45    # Create temporary execution environment
46    with tempfile.NamedTemporaryFile(suffix=".py", delete=False) as temp_file:
47        # TODO: Customize the execution script for your specific task
48        script = """
49# Template for subprocess execution
50import sys
51import pickle
52import traceback
53
54try:
55    # Import user program
56    # TODO: Adapt import mechanism for your task
57    spec = __import__('importlib.util').util.spec_from_file_location("program", '{program_path}')
58    program = __import__('importlib.util').util.module_from_spec(spec)
59    spec.loader.exec_module(program)
60    
61    # Execute main function
62    # TODO: Replace with your task's entry function
63    result = program.main_function()
64    
65    # Save results
66    with open('{results_path}', 'wb') as f:
67        pickle.dump({{'result': result}}, f)
68        
69except Exception as e:
70    # Save error information
71    with open('{results_path}', 'wb') as f:
72        pickle.dump({{'error': str(e), 'traceback': traceback.format_exc()}}, f)
73    sys.exit(0)
74"""
75        temp_file.write(script.encode())
76        temp_file_path = temp_file.name
77
78    results_path = f"{temp_file_path}.results"
79
80    try:
81        # Execute with timeout
82        process = subprocess.Popen(
83            [sys.executable, temp_file_path], 
84            stdout=subprocess.PIPE, 
85            stderr=subprocess.PIPE
86        )
87
88        stdout, stderr = process.communicate(timeout=timeout_seconds)
89        exit_code = process.returncode
90
91        # Process results
92        if os.path.exists(results_path):
93            with open(results_path, "rb") as f:
94                results = pickle.load(f)
95
96            if "error" in results:
97                raise RuntimeError(f"Execution failed: {results['error']}")
98            return results["result"]
99
100        # Handle missing results
101        if exit_code != 0:
102            raise RuntimeError(f"Process failed with exit code {exit_code}")
103        else:
104            raise RuntimeError("Results file not found")
105
106    except subprocess.TimeoutExpired:
107        process.kill()
108        process.wait()
109        raise TimeoutError(f"Timeout after {timeout_seconds} seconds")
110
111    finally:
112        # Cleanup
113        for path in [temp_file_path, results_path]:
114            if os.path.exists(path):
115                os.unlink(path)
116
117
118def calculate_metrics(result, eval_time: float) -> dict:
119    """
120    Calculate evaluation metrics based on the result.
121    
122    Args:
123        result: The output from user's code execution
124        eval_time: Execution time in seconds
125        
126    Returns:
127        Dictionary of metrics (must include validity and combined_score)
128    """
129    # TODO: Implement your specific metric calculations
130    
131    # Validate the result
132    validity = 1.0 if validate_constraints(result) else 0.0
133    
134    # TODO: Calculate task-specific metrics
135    performance_score = 0.0  # Replace with actual performance measurement
136    
137    # Combined score - customize based on your task requirements
138    combined_score = performance_score * validity
139
140    return {
141        "validity": float(validity),
142        "combined_score": float(combined_score),
143        "eval_time": float(eval_time),
144        # TODO: Add your task-specific metrics
145        "performance": float(performance_score),
146    }
147
148
149def evaluate(program_path: str) -> dict:
150    """
151    Main evaluation function.
152    
153    Args:
154        program_path: Path to the user's program file
155        
156    Returns:
157        Dictionary containing all evaluation metrics
158    """
159    try:
160        # Execute user program with timeout
161        start_time = time.time()
162        result = run_with_timeout(program_path, timeout_seconds=600)
163        end_time = time.time()
164        eval_time = end_time - start_time
165
166        # Calculate metrics
167        metrics = calculate_metrics(result, eval_time)
168        return metrics
169
170    except Exception as e:
171        # Catch any exceptions and return default scores
172        print(f"Evaluation failed: {str(e)}")
173        traceback.print_exc()
174        
175        return {
176            "validity": 0.0,
177            "combined_score": 0.0,
178            "eval_time": 0.0,
179            "performance": 0.0,
180            "error_info": {"run_error": str(e)}
181        }
182
183
184if __name__ == "__main__":
185    if len(sys.argv) != 2:
186        print("Usage: python evaluator.py <path_to_program_or_json>")
187        sys.exit(1)
188        
189    input_path = sys.argv[1]
190    
191    # Handle different input types
192    if input_path.endswith('.json'):
193        # Extract code from JSON
194        with open(input_path, 'r', encoding='utf-8') as f:
195            data = json.load(f)
196        code = data["code"]
197        
198        # Write to temporary file
199        temp_program = "temp_user_program.py"
200        with open(temp_program, 'w', encoding='utf-8') as f:
201            f.write(code)
202        program_to_evaluate = temp_program
203    else:
204        # Direct Python file
205        program_to_evaluate = input_path
206
207    # Run evaluation
208    result = evaluate(program_to_evaluate)
209    
210    # Cleanup
211    if input_path.endswith('.json') and os.path.exists(temp_program):
212        os.unlink(temp_program)
213        
214    print("--- EVALUATION RESULT ---")
215    print(json.dumps(result, indent=2))

一个完整的 Evaluator 包括四个核心部分：

约束校验（validate_constraints）：检查候选解是否满足任务规则。

示例：

Plain Text

1def validate_constraints(result):
2    # 长度正确、格式正确、值域合法等
3    if not isinstance(result, list):
4        return False
5    if len(result) != 16:
6        return False
7    return True

安全执行（run_with_timeout）：平台会在隔离环境中执行候选程序，您可以根据任务设置统一的运行时间上限（系统上限是1800s，达到上限会出现任务报错）。

Evaluator 会自动： 在子进程中执行用户代码 超时自动终止 * 捕获报错与输出结果

指标计算（calculate_metrics）：根据候选解输出计算最终评分。必须返回一个包含 4 个关键字段的字典：

Plain Text

1return {
2    "validity": float(0.0 or 1.0),        # 是否有效
3    "combined_score": float(score),   # 综合评分，系统排序依据
4    "eval_time": float(eval_time),    # 执行时长
5    "performance": float(perf)        # 任务自定义性能指标
6}

通常 combined_score 取值为[0, 1]，任何无效情况都应被量化为0.0，且需要考虑到有效性：combined_score = performance * validity

主入口（evaluate）：将执行、评估、错误处理统一封装，平台会调用此接口：

Plain Text

1def evaluate(path_user_py):
2    result, eval_time, error = run_with_timeout(program_path)
3    ...
4    metrics = calculate_metrics(result, eval_time)
5    return metrics

设计Evaluator的Tips:

必须返回标准化的评分字典,并且combined_score需要越大越好（我们基于更高的分数进行优化）

Plain Text

1return {
2    "validity": float(0.0 or 1.0),        # 是否有效
3    "combined_score": float(score),   # 综合评分，系统排序依据
4    "eval_time": float(eval_time),    # 执行时长
5    "performance": float(perf)        # 任务自定义性能指标
6}

以下字段缺一不可：

validity
combined_score
eval_time

这是系统进行排序、汇总、演化的根基，而且返回信息越详细越好，尤其是代码执行报错信息，如果存在请尽可能返回详细的报错信息。

候选程序必须包含默认入口函数

平台会运行候选解中的：

Plain Text

1def main_function():
2    return result

因此用户提交的代码必须定义该函数，否则 Evaluator 会报错。如需改用其他函数名，可在模板中统一修改。

评估时间不能超过用户设定上限（系统上限1800秒）。超时会导致：本次评估 invalid、返回错误信息

技术框架

IDE查看技巧

百度智能云

百度伐谋公有云 FAMOU

百度伐谋公有云 FAMOU

如何写好Evaluator?

Evaluator是什么

为什么这个重要

参考模版

设计Evaluator的Tips: