评估器模板

更新时间：2026-02-06

1.模板概述

我们整理了一个评估器模板，模板包含三个模块：

run_with_timeout函数：运行被评估代码并返回结果results
evaluate函数：调用run_with_timeout函数，根据返回结果results计算得分metrics
自定义约束检测与评分逻辑：
- 定义validate(results)函数，校验结果是否可行（validity：0.0/1.0)
- 定义score(results)函数，根据结果计算分数combined_score

以下为我们提供的模板，您可以根据注释进行相应修改

Plain Text

1import importlib.util
2import numpy as np
3import time
4import os
5import signal
6import subprocess
7import tempfile
8import traceback
9import sys
10import pickle
11import xlrd
12import uuid
13import json
14import pandas as pd
15import re
16import ast
17
18
19
20
21#---------------一、运行被评估代码--------------------
22def run_with_timeout(program_path: str, function_name: str, args: dict, timeout_seconds: int=2000) -> dict:
23    #函数定义无须修改，直接复制即可
24    """
25    Run the program in a separate process with timeout
26    using a simple subprocess approach
27
28    Args:
29        program_path: Path to the program file
30        function_name: function name (str)
31        args: arguments (dict)
32        timeout_seconds: Maximum execution time in seconds
33
34    Returns:
35        results
36    """
37    temp_path = f"{function_name}_{uuid.uuid4().hex}.pkl"
38    with open(temp_path, "wb") as f:
39        pickle.dump(args, f)
40
41    # Create a temporary file to execute
42    with tempfile.NamedTemporaryFile(suffix=".py", delete=False) as temp_file:
43        # Write a script that executes the program and saves results
44        script = f"""
45import sys
46import numpy as np
47import os
48import pickle
49import traceback
50import inspect
51
52# Add the directory to sys.path
53sys.path.insert(0, os.path.dirname('{program_path}'))
54
55# Debugging info
56# print(f"Running in subprocess, Python version: {{sys.version}}")
57# print(f"Program path: {program_path}")
58
59try:
60    # read args
61    with open('{temp_path}', 'rb') as f:
62        args = pickle.load(f)
63    # Import the program
64    spec = __import__('importlib.util').util.spec_from_file_location("program", '{program_path}')
65    program = __import__('importlib.util').util.module_from_spec(spec)
66    spec.loader.exec_module(program)
67
68    # Run the packing function
69    entry = getattr(program, "{function_name}", None)
70    if entry is None:
71        raise AttributeError("new_born program.py doesn't have {function_name}")
72
73    results = entry(**args)
74
75    with open('{temp_file.name}.results', 'wb') as f:
76        pickle.dump(results, f)
77    # print(f"Results saved to {temp_file.name}.results")
78    
79except Exception as e:
80    tb = traceback.format_exc()
81    # Try to include the source code of the target function as well
82    try:
83        if entry is not None:
84            fn = inspect.unwrap(entry)
85            function = inspect.getsource(fn)
86        else:
87            function = ""
88    except Exception:
89        function = ""
90    with open('{temp_file.name}.results', 'wb') as f:
91        pickle.dump({{'error': f'{{e}}', 'traceback': tb, 'function': function}}, f)
92    # Key point: exit with code 0 even on failure to prevent the parent process from seeing a non-zero exit code.
93    sys.exit(0)
94""" 
95        temp_file.write(script.encode())
96        temp_file_path = temp_file.name
97
98    results_path = f"{temp_file_path}.results"
99
100    try:
101        # Run the script with timeout
102        process = subprocess.Popen(
103            [sys.executable, temp_file_path], stdout=subprocess.PIPE, stderr=subprocess.PIPE
104        )
105
106        try:
107            stdout, stderr = process.communicate(timeout=timeout_seconds)
108            exit_code = process.returncode
109
110            # Still raise an error for non-zero exit codes, but only after printing the output
111            if os.path.exists(results_path):
112                with open(results_path, "rb") as f:
113                    results = pickle.load(f)
114                if "error" in results:
115                    # Wrap the child-process error and propagate it upward for the caller to decide whether to continue
116                    err = results.get("error", "")
117                    tb = results.get("traceback", "")
118                    fn = results.get("function", "")
119                    raise RuntimeError(f"Program execution failed: {err}\nTraceback:\n{tb}\nFunction:\n{fn}")
120                return results
121
122            # If the result file is missing, evaluate the exit code and throw an error if needed, attaching stdout/stderr
123            if exit_code != 0:
124                raise RuntimeError(
125                    f"Process exited with code {exit_code}\n"
126                    f"---- STDOUT ----\n{stdout.decode(errors='ignore')}\n"
127                    f"---- STDERR ----\n{stderr.decode(errors='ignore')}"
128                )
129            else:
130                raise RuntimeError("Results file not found")
131
132        except subprocess.TimeoutExpired:
133            # Kill the process if it times out
134            process.kill()
135            process.wait()
136            raise TimeoutError(f"Process timed out after {timeout_seconds} seconds")
137
138    finally:
139        if os.path.exists(temp_path):
140            os.unlink(temp_path)
141        # Clean up temporary files
142        if os.path.exists(temp_file_path):
143            os.unlink(temp_file_path)
144        if os.path.exists(results_path):
145            os.unlink(results_path)
146
147
148#---------------二、评估函数--------------------
149def evaluate(path_user_py: str) :#函数名不可变，path_user_py参数名不可变
150    #1.定义返回的字典
151    metrics= {
152        "validity": 0.0, #必须，0/1，判断生成的代码的有效性
153        "combined_score": 0.0, #必须，float,判断生成的代码的质量，分数越高越好
154        "feature1": 0.0,  #添加任何你关注的指标，可以没有,变量名自定义
155        "feature2": 0.0, #添加任何你关注的指标，可以没有,变量名自定义
156        "error_info": {}#必须，字典格式，在生成的代码无效的时候，需要在这个字段中说明为什么无效
157    }
158
159    #2.定义被评估代码的主入口函数名function及传递的参数args
160    function="main" # 根据实际情况修改函数名
161    input_path = "./1_sample_train_with_labels.csv"
162    args = {"input_path": input_path} # 字典格式，传入对应的参数
163    
164    #3.调用run_with_timeout函数，运行被评估代码中的函数function并获取返回结果
165    try:
166        results = run_with_timeout(
167            path_user_py, function, args, timeout_seconds=2400  
168        )
169    #4.约束校验与评分：调用第三部分的约束校验函数与评分函数计算metrics 
170        # 4.1 约束校验
171        valid = validate(results)
172        #在这里检验你所有的约束
173        metrics["validity"] = valid
174        if valid==0:
175            metrics["error_info"] = {"not valid":"在这里说明为什么不满足"}
176
177        # 4.2 如果满足约束，再计算 combined_score 和其他指标
178        if valid:
179            # 这里根据你的任务自己写评分逻辑
180            # 例如：
181            metrics["combined_score"] = score(results)
182            metrics["feature1"] = ...
183            metrics["feature2"] = ...
184            pass
185
186    #可以新增你希望捕捉到的异常信息
187    except TimeoutError:
188        metrics['error_info']= {"timeout": "process timeout"}
189
190    except Exception as e:
191        metrics["error_info"]={"exception": str(e),"traceback": traceback.format_exc()}
192
193    return metrics
194
195#---------------三、约束校验函数与评分函数--------------------
196#校验结果results
197定义validate(results)函数，校验结果是否可行（validity：0.0/1.0)
198#计算分数
199定义score(results)函数，根据结果计算分数combined_score

2.详细说明

2.1运行被评估代码（run_with_timeout函数）

函数作用：在单独的子进程中运行被评估代码的指定函数，得到结果results。
参数介绍：

参数名	类型	含义
program_path	str	代码地址，上传 evaluate 函数的参数 path_user_py即可
function_name	str	被评估代码中被调用的函数名（通常为被评估代码主入口函数），要求该函数返回值为字典格式
args	dict	被调用函数的参数通过该参数传递
timeout_seconds	int	运行被评估代码的时间上限，必须小于 3600s

返回结果：

情况	返回结果
正常执行	function_name 函数返回值
子进程捕获错误	RuntimeError(...)（包含 error、traceback、源码）
超时	TimeoutError(...)
results 文件不存在且 exit_code != 0	RuntimeError(...)
results 文件不存在但 exit_code == 0	RuntimeError("Results file not found")

具体使用说明：函数定义不需要修改任何东西，在evaluate函数中调用的时候上传对应的参数即可，具体如下：

Plain Text

1#---------------一、运行被评估代码--------------------
2def run_with_timeout(program_path: str, function_name: str, args: dict, timeout_seconds: int=2000) -> dict:
3    **# 该部分无需进行任何修改**
4    # 此处为简化展示，省略函数定义，具体函数定义在上文模板中
5    
6def evaluate(path_user_py: str) :
7
8    #1.定义返回的字典
9    
10   ** # 需要进行修改的为该部分：**
11    #2.定义被评估代码的主入口函数名及传递的参数args
12    function="main" # 根据实际情况修改函数名
13    input_path = "./1_sample_train_with_labels.csv"
14    args = {"input_path": input_path} # 字典格式，传入对应的参数
15    
16    #3.调用run_with_timeout函数，运行被评估代码中的函数并获取返回结果
17    try:
18        results = run_with_timeout(
19            path_user_py, function, args, timeout_seconds=2400  
20        )

2.2主评估函数（evaluate函数）

函数作用：调用run_with_timeout函数运行被评估代码并计算得分，伐谋平台在每一轮演化中都会调用该函数
- 运行path_user_py，获取结果results
- 根据results计算指标metrics
- 返回metrics
硬性要求：
- evaluate函数名不可变
- 必须包含参数 path_user_py 且参数名不可改变
- 返回为字典格式且须包含以下三个关键字段（字段名不可变）：validity、combined_score、error_info

字段名	类型	含义及要求
validity	0.0/1.0	字段作用：判断新生成的代码的有效性自定义有效性判断条件，如果新生成的代码符合约束条件，设置为1，不符合设置为0
combined_score	float	字段作用：伐谋以combined_score为指南针，不断改进代码提高combined_score 自定义计算方式，combined_score值越高说明新生成的代码越好
error_info	dict	字段作用：在validity为0的时候，通过error_info字段告诉模型不满足约束的具体原因，模型才能知道具体的改进方向在validity=1的时候，需要将error_info设为空字典{};在validity=0的时候，需要自定义返回error_info的信息

具体使用说明：可以按照如下框架进行修改使用

Plain Text

1#---------------二、评估函数--------------------
2def evaluate(path_user_py: str) :#函数名不可变，path_user_py参数名不可变
3
4    #1.定义返回的字典
5    metrics= {
6        "validity": 0.0, #必须，0/1，判断生成的代码的有效性
7        "combined_score": 0.0, #必须，float,判断生成的代码的质量，分数越高越好
8        "feature1": 0.0,  #添加任何你关注的指标，可以没有,变量名自定义
9        "feature2": 0.0, #添加任何你关注的指标，可以没有,变量名自定义
10        "error_info": {}#必须，字典格式，在生成的代码无效的时候，需要在这个字段中说明为什么无效
11    }
12
13    #2.定义被评估代码的主入口函数名function及传递的参数args
14    function="main"#此处为例子，需要换成实际情况的
15    input_path = "./1_sample_train_with_labels.csv"
16    args = {"input_path": input_path}# args要求为字典格式
17    
18    
19    #3.调用run_with_timeout函数，运行被评估代码中的函数function并获取返回结果
20    try:
21        results = run_with_timeout(
22            path_user_py, function, args, timeout_seconds=2400  
23        )
24    #4.约束校验与评分:调用第三部分的约束校验函数与评分函数计算metrics 
25        # 4.1 约束校验
26        valid = validate(results)
27        #在这里检验你所有的约束
28        metrics["validity"] = valid
29        if valid==0:
30            metrics["error_info"] = {"not valid":"在这里说明为什么不满足"}
31
32        # 4.2 如果满足约束，再计算 combined_score 和其他指标
33        if valid:
34            # 这里根据你的任务自己写评分逻辑
35            # 例如：
36            metrics["combined_score"] = score(results)
37            metrics["feature1"] = ...
38            metrics["feature2"] = ...
39            
40
41    #可以新增你希望捕捉到的异常信息
42    except TimeoutError:
43        metrics['error_info']= {"timeout": "process timeout"}
44
45    except Exception as e:
46        metrics["error_info"]={"exception": str(e),"traceback": traceback.format_exc()}
47
48    return metrics

2.3自定义约束检测与评分逻辑

自定义validate(results)函数，校验结果是否可行（validity：0.0/1.0)
自定义score(results)函数，根据结果计算分数combined_score
在evaluate函数中被调用，用于计算metrics

Plain Text

1#---------------二、评估函数--------------------
2def evaluate(path_user_py: str) :#函数名不可变，path_user_py参数名不可变
3
4    #1.定义返回的字典
5    #2.定义被评估代码的主入口函数名function及传递的参数args  
6    #3.调用run_with_timeout函数，运行被评估代码中的函数function并获取返回结果
7   
8    **#4.约束校验与评分:调用第三部分的约束校验函数与评分函数计算metrics **
9        # 4.1 约束校验
10        valid = validate(results)
11        #在这里检验你所有的约束
12        metrics["validity"] = valid
13        if valid==0:
14            metrics["error_info"] = {"not valid":"在这里说明为什么不满足"}
15
16        # 4.2 如果满足约束，再计算 combined_score 和其他指标
17        if valid:
18            # 这里根据你的任务自己写评分逻辑
19            # 例如：
20            metrics["combined_score"] = score(results)
21            # metrics["feature1"] = ...
22            # metrics["feature2"] = ...
23            pass
24
25
26**#---------------三、约束校验函数与评分函数--------------------**
27#校验结果results
28定义validate(results)函数，校验结果是否可行（validity：0.0/1.0)
29#计算分数
30定义score(results)函数，根据结果计算分数combined_score

3.使用小结

run_with_timeout函数定义部分不需要修改，直接使用
在evaluate函数中调用run_with_timeout函数时，需指定被评估代码中被调用的函数并传入对应的参数，获取results，并根据校验函数与评分函数计算metrics
要求被评估代码中被run_with_timeout调用的函数的返回值为字典
在第三部分中自定义约束校验函数与评分函数

4.检查清单

必须包含 evaluate 函数，该函数必须包含参数 path_user_py 且参数名不可改变
metrics 必须包含validity 、 combined_score 与 error_info 三个字段

三步上手

初始代码模板

百度智能云

百度伐谋公有云 FAMOU

百度伐谋公有云 FAMOU

评估器模板

1.模板概述

2.详细说明

2.1运行被评估代码（run_with_timeout函数）

2.2主评估函数（evaluate函数）

2.3自定义约束检测与评分逻辑

3.使用小结

4.检查清单