使用技巧

更新时间：2026-05-18

三步上手

百度伐谋·管理平台是一个强大的代码进化引擎：它会在你的目标与约束下，不断生成更优的代码。要开始使用伐谋，你只需要准备好两个核心文件(初始代码与评估器代码）以及提示词。

1）初始代码：进化的起点，你提供一份“能跑的基线代码”，伐谋会在此基础上生成变体、持续优化。

2）评估器：进化的“裁判”，每一轮伐谋生成的新代码，都需要由评估器运行并打分，评估器会告诉伐谋：

这份代码是否有效（是否满足约束）
这份代码有多优秀（分数越高越好）
如果无效，具体哪里不满足（用于指导下一轮改进）

3）提示词：进化的“说明书”，你用 prompt 告诉伐谋应该如何改代码、优化方向是什么、有哪些必须遵守的限制。

初始代码、评估器以及提示词的关系

初始代码、评估器以及提示词有一些硬性要求和使用建议，下边会进行详细介绍。你也可以直接跳转到使用模版，直接根据模板进行修改。

准备初始代码 (init.py)：

硬性要求：使用 # EVOLVE-BLOCK-START 和 # EVOLVE-BLOCK-END 标记出你希望被伐谋改进的代码块。
设计原则：推荐把代码分成两层。
- 稳定层（不可改）：I/O、数据解析、主入口函数等。
- 可变层（可改）：你希望伐谋优化的部分，写在# EVOLVE-BLOCK-START 和 # EVOLVE-BLOCK-END内。

Markdown

1# EVOLVE-BLOCK-START
2def your_function():
3    # 希望被伐谋演化的部分
4    pass
5def helper1():
6    # 可以要求模型生成一些辅助函数，希望被修改的辅助函数也放在# EVOLVE-BLOCK内
7    pass
8# EVOLVE-BLOCK-END
9
10def helper2():
11    # 不希望被修改的辅助函数放在# EVOLVE-BLOCK外
12    pass
13
14def main():
15    # 建议把主入口函数放在# EVOLVE-BLOCK外

编写评估器 (evaluator.py)：

◦ 核心职能：运行并评估初始代码。

◦ 硬性要求：必须包含 evaluate 函数且该函数有如下硬性规定：
- 1. 必须包含参数 path_user_py 且参数名不可改变。
- 1. 返回为字典格式且须包含以下三个关键字段：

字段名	类型	含义及要求
validity	0.0/1.0	判断新生成的代码的有效性。自定义有效性判断条件，如果新生成的代码符合约束条件，设置为1，不符合设置为0。
combined_score	float	伐谋以combined_score为指南针，不断改进代码提高combined_score。自定义计算方式，combined_score值越高说明新生成的代码越好。
error_info	dict	在validity为0的时候，通过error_info字段告诉模型不满足约束的具体原因，模型才能知道具体的改进方向。在validity=1的时候，需要将error_info设为空字典{}；在validity=0的时候，需要自定义返回error_info的信息。

Markdown

1def evaluate(path_user_py: str) :#函数名不可变，path_user_py参数名不可变
2
3    #定义返回的字典
4    metrics= {
5        "validity": 0.0, #必须，0.0/1.0，判断生成的代码的有效性
6        "combined_score": 0.0, #必须，float,判断生成的代码的质量，分数越高越好
7        "feature1": 0.0,  #添加任何你关注的指标，可以没有,变量名自定义
8        "feature2": 0.0, #添加任何你关注的指标，可以没有,变量名自定义
9        "error_info": {}#必须，字典格式，在生成的代码无效的时候，需要在这个字段中说明为什么无效
10    }

编写 Prompt：

◦ prompt是指导模型生成和改进代码的重要说明，一个好的prompt通常需要包括以下五个部分：

Markdown

11. 角色与目标
2   - 我是谁
3   - 我要优化什么指标
4
52. 任务定义
6    - 详细说明任务具体背景
7
83. 数据与字段说明
9    - 输入数据各字段含义
10
114.  EVOLVE-BLOCK内代码要求
12   - 只能修改 EVOLVE-BLOCK
13   - 生成函数的输入输出要求
14   - 可以说明希望采取的具体策略，指导演化
15
165. 约束校验与评估逻辑说明
17   - 如果是组合优化问题，需要把约束详细说明，与评估器中validity的判断逻辑一一对应
18   - 介绍metrics中的指标具体如何计算的，详细介绍combined_score及其他自定义字段的具体含义

使用模板

评估器模板

模板概述

我们整理了一个评估器模板，模板包含三个模块：

1.run_with_timeout函数：运行被评估代码并返回结果results。

2.evaluate函数：调用run_with_timeout函数，根据返回结果results计算得分metrics。

3.自定义约束检测与评分逻辑：

Plain Text

1定义validate(results)函数，校验结果是否可行（validity：0.0/1.0）。
2定义score(results)函数，根据结果计算分数combined_score。

以下为我们提供的模板，您可以根据注释进行相应修改：

Markdown

1import importlib.util
2import numpy as np
3import time
4import os
5import signal
6import subprocess
7import tempfile
8import traceback
9import sys
10import pickle
11import xlrd
12import uuid
13import json
14import pandas as pd
15import re
16import ast
17import timeoutError
18
19
20
21#---------------一、运行被评估代码--------------------
22def run_with_timeout(program_path: str, function_name: str, args: dict, timeout_seconds: int=2000) -> dict:
23    #函数定义无须修改，直接复制即可
24    """
25    Run the program in a separate process with timeout
26    using a simple subprocess approach
27
28    Args:
29        program_path: Path to the program file
30        function_name: function name (str)
31        args: arguments (dict)
32        timeout_seconds: Maximum execution time in seconds
33
34    Returns:
35        results
36    """
37    temp_path = f"{function_name}_{uuid.uuid4().hex}.pkl"
38    with open(temp_path, "wb") as f:
39        pickle.dump(args, f)
40
41    # Create a temporary file to execute
42    with tempfile.NamedTemporaryFile(suffix=".py", delete=False) as temp_file:
43        # Write a script that executes the program and saves results
44        script = f"""
45import sys
46import numpy as np
47import os
48import pickle
49import traceback
50import inspect
51
52# Add the directory to sys.path
53sys.path.insert(0, os.path.dirname('{program_path}'))
54
55# Debugging info
56# print(f"Running in subprocess, Python version: {{sys.version}}")
57# print(f"Program path: {program_path}")
58
59try:
60    # read args
61    with open('{temp_path}', 'rb') as f:
62        args = pickle.load(f)
63    # Import the program
64    spec = __import__('importlib.util').util.spec_from_file_location("program", '{program_path}')
65    program = __import__('importlib.util').util.module_from_spec(spec)
66    spec.loader.exec_module(program)
67
68    # Run the packing function
69    entry = getattr(program, "{function_name}", None)
70    if entry is None:
71        raise AttributeError("new_born program.py doesn't have {function_name}")
72
73    results = entry(**args)
74
75    with open('{temp_file.name}.results', 'wb') as f:
76        pickle.dump(results, f)
77    # print(f"Results saved to {temp_file.name}.results")
78    
79except Exception as e:
80    tb = traceback.format_exc()
81    # Try to include the source code of the target function as well
82    try:
83        if entry is not None:
84            fn = inspect.unwrap(entry)
85            function = inspect.getsource(fn)
86        else:
87            function = ""
88    except Exception:
89        function = ""
90    with open('{temp_file.name}.results', 'wb') as f:
91        pickle.dump({{'error': f'{{e}}', 'traceback': tb, 'function': function}}, f)
92    # Key point: exit with code 0 even on failure to prevent the parent process from seeing a non-zero exit code.
93    sys.exit(0)
94""" 
95        temp_file.write(script.encode())
96        temp_file_path = temp_file.name
97
98    results_path = f"{temp_file_path}.results"
99
100    try:
101        # Run the script with timeout
102        process = subprocess.Popen(
103            [sys.executable, temp_file_path], stdout=subprocess.PIPE, stderr=subprocess.PIPE
104        )
105
106        try:
107            stdout, stderr = process.communicate(timeout=timeout_seconds)
108            exit_code = process.returncode
109
110            # Still raise an error for non-zero exit codes, but only after printing the output
111            if os.path.exists(results_path):
112                with open(results_path, "rb") as f:
113                    results = pickle.load(f)
114                if "error" in results:
115                    # Wrap the child-process error and propagate it upward for the caller to decide whether to continue
116                    err = results.get("error", "")
117                    tb = results.get("traceback", "")
118                    fn = results.get("function", "")
119                    raise RuntimeError(f"Program execution failed: {err}\nTraceback:\n{tb}\nFunction:\n{fn}")
120                return results
121
122            # If the result file is missing, evaluate the exit code and throw an error if needed, attaching stdout/stderr
123            if exit_code != 0:
124                raise RuntimeError(
125                    f"Process exited with code {exit_code}\n"
126                    f"---- STDOUT ----\n{stdout.decode(errors='ignore')}\n"
127                    f"---- STDERR ----\n{stderr.decode(errors='ignore')}"
128                )
129            else:
130                raise RuntimeError("Results file not found")
131
132        except subprocess.TimeoutExpired:
133            # Kill the process if it times out
134            process.kill()
135            process.wait()
136            raise TimeoutError(f"Process timed out after {timeout_seconds} seconds")
137
138    finally:
139        if os.path.exists(temp_path):
140            os.unlink(temp_path)
141        # Clean up temporary files
142        if os.path.exists(temp_file_path):
143            os.unlink(temp_file_path)
144        if os.path.exists(results_path):
145            os.unlink(results_path)
146
147
148#---------------二、评估函数--------------------
149def evaluate(path_user_py: str) :#函数名不可变，path_user_py参数名不可变
150    #1.定义返回的字典
151    metrics= {
152        "validity": 0.0, #必须，0/1，判断生成的代码的有效性
153        "combined_score": 0.0, #必须，float,判断生成的代码的质量，分数越高越好
154        "feature1": 0.0,  #添加任何你关注的指标，可以没有,变量名自定义
155        "feature2": 0.0, #添加任何你关注的指标，可以没有,变量名自定义
156        "error_info": {}#必须，字典格式，在生成的代码无效的时候，需要在这个字段中说明为什么无效
157    }
158
159    #2.定义被评估代码的主入口函数名function及传递的参数args
160    function=""
161    args = {
162    }
163    
164    #3.调用run_with_timeout函数，运行被评估代码中的函数function并获取返回结果
165    try:
166        results = run_with_timeout(
167            path_user_py, function, args, timeout_seconds=2400  
168        )
169    #4.约束校验与评分
170        # 4.1 约束校验
171        valid = validate(results)
172        #在这里检验你所有的约束
173        metrics["validity"] = valid
174        if valid==0:
175            metrics["error_info"] = {"not valid":"在这里说明为什么不满足"}
176
177        # 4.2 如果满足约束，再计算 combined_score 和其他指标
178        if valid:
179            # 这里根据你的任务自己写评分逻辑
180            # 例如：
181            metrics["combined_score"] = score(results)
182            metrics["feature1"] = ...
183            metrics["feature2"] = ...
184            pass
185
186    #可以新增你希望捕捉到的异常信息
187    except TimeoutError:
188        metrics['error_info']= {"timeout": "process timeout"}
189
190    except Exception as e:
191        metrics["error_info"]={"exception": str(e),"traceback": traceback.format_exc()}
192
193    return metrics
194
195#---------------三、约束校验函数与评分函数--------------------
196#校验结果results
197定义validate(results)函数，校验结果是否可行（validity：0.0/1.0)
198#计算分数
199定义score(results)函数，根据结果计算分数combined_score

模板详细说明

2.1 运行被评估代码（`run_with_timeout`函数）

函数作用：在单独的子进程中运行被评估代码的指定函数，得到结果results。参数介绍：

参数名	类型	含义
program_path	str	代码地址，即evaluate函数的参数path_user_py。
function_name	str	被评估代码中被调用的函数名（通常为被评估代码主入口函数），要求该函数返回值为字典格式。
args	dict	被评估代码中被调用的函数需要传入的参数，要求为字典格式。
timeout_seconds	int	运行被评估代码的时间上限，必须小于3600s。

返回结果：

情况	返回结果
正常执行	function_name函数返回值
子进程捕获错误	RuntimeError(...)（包含 error、traceback、源码）
超时	TimeoutError(...)
results 文件不存在且 exit_code != 0	RuntimeError(...)
results 文件不存在但 exit_code == 0	RuntimeError("Results file not found")

具体使用说明：函数定义不需要修改任何东西，在evaluate函数中调用的时候上传对应的参数即可，具体如下：

Markdown

1def evaluate(path_user_py: str) :
2
3    #1.定义返回的字典
4    
5    ** # 需要进行修改的为该部分：**
6    #2.定义被评估代码的主入口函数名及传递的参数args
7    function="main"
8    input_path = "./1_sample_train_with_labels.csv"
9    args = {"input_path": input_path}#此处为例子，需要换成实际情况的
10    
11    #3.调用run_with_timeout函数，运行被评估代码中的函数并获取返回结果
12    try:
13        results = run_with_timeout(
14            path_user_py, function, args, timeout_seconds=2400  
15        )

2.2 主评估函数（`evaluate`函数）

函数作用：调用run_with_timeout函数运行被评估代码并计算得分，伐谋平台在每一轮演化中都会调用该函数

运行path_user_py，获取结果results
根据results计算指标metrics
返回metrics

硬性要求：

evaluate函数名不可变
必须包含参数 path_user_py 且参数名不可改变。
返回为字典格式且须包含以下三个关键字段（字段名不可变）：validity、combined_score、error_info

关键词	类型	含义
validity	0.0/1.0	判断新生成的代码的有效性。自定义有效性判断条件，如果新生成的代码符合约束条件，设置为1，不符合设置为0。
combined_score	float	伐谋以combined_score为指南针，不断改进代码提高combined_score。自定义计算方式，combined_score值越高说明新生成的代码越好。
error_info	dict	在validity为0的时候，通过error_info字段告诉模型不满足约束的具体原因，模型才能知道具体的改进方向。在validity=1的时候，需要将error_info设为空字典{}；在validity=0的时候，需要自定义返回error_info的信息

具体使用说明：可以按照如下框架进行修改使用：

Markdown

1#---------------二、评估函数--------------------
2def evaluate(path_user_py: str) :#函数名不可变，path_user_py参数名不可变
3
4    #1.定义返回的字典
5    metrics= {
6        "validity": 0.0, #必须，0/1，判断生成的代码的有效性
7        "combined_score": 0.0, #必须，float,判断生成的代码的质量，分数越高越好
8        "feature1": 0.0,  #添加任何你关注的指标，可以没有,变量名自定义
9        "feature2": 0.0, #添加任何你关注的指标，可以没有,变量名自定义
10        "error_info": {}#必须，字典格式，在生成的代码无效的时候，需要在这个字段中说明为什么无效
11    }
12
13    #2.定义被评估代码的主入口函数名function及传递的参数args
14    function="main"#此处为例子，需要换成实际情况的
15    input_path = "./1_sample_train_with_labels.csv"
16    args = {"input_path": input_path}# args要求为字典格式
17    
18    
19    #3.调用run_with_timeout函数，运行被评估代码中的函数function并获取返回结果
20    try:
21        results = run_with_timeout(
22            path_user_py, function, args, timeout_seconds=2400  
23        )
24    #4.约束校验与评分:调用第三部分的约束校验函数与评分函数计算metrics 
25        # 4.1 约束校验
26        valid = validate(results)
27        #在这里检验你所有的约束
28        metrics["validity"] = valid
29        if valid==0:
30            metrics["error_info"] = {"not valid":"在这里说明为什么不满足"}
31
32        # 4.2 如果满足约束，再计算 combined_score 和其他指标
33        if valid:
34            # 这里根据你的任务自己写评分逻辑
35            # 例如：
36            metrics["combined_score"] = score(results)
37            metrics["feature1"] = ...
38            metrics["feature2"] = ...
39            
40
41    #可以新增你希望捕捉到的异常信息
42    except TimeoutError:
43        metrics['error_info']= {"timeout": "process timeout"}
44
45    except Exception as e:
46        metrics["error_info"]={"exception": str(e),"traceback": traceback.format_exc()}
47
48    return metrics

2.3 自定义约束检测与评分逻辑

自定义validate(results)函数，校验结果是否可行（validity：0.0/1.0)。
自定义score(results)函数，根据结果计算分数combined_score。
在evaluate函数中被调用，用于计算metrics。

Markdown

1#---------------二、评估函数--------------------
2def evaluate(path_user_py: str) :#函数名不可变，path_user_py参数名不可变
3
4    #1.定义返回的字典
5    #2.定义被评估代码的主入口函数名function及传递的参数args  
6    #3.调用run_with_timeout函数，运行被评估代码中的函数function并获取返回结果
7   
8    #4.约束校验与评分:调用第三部分的约束校验函数与评分函数计算metrics 
9        # 4.1 约束校验
10        valid = validate(results)
11        #在这里检验你所有的约束
12        metrics["validity"] = valid
13        if valid==0:
14            metrics["error_info"] = {"not valid":"在这里说明为什么不满足"}
15
16        # 4.2 如果满足约束，再计算 combined_score 和其他指标
17        if valid:
18            # 这里根据你的任务自己写评分逻辑
19            # 例如：
20            metrics["combined_score"] = score(results)
21            # metrics["feature1"] = ...
22            # metrics["feature2"] = ...
23            pass
24
25    #可以新增你希望捕捉到的异常信息
26    except TimeoutError:
27        metrics['error_info']= {"timeout": "process timeout"}
28
29    except Exception as e:
30        metrics["error_info"]={"exception": str(e),"traceback": traceback.format_exc()}
31
32    return metrics
33#---------------三、约束校验函数与评分函数--------------------
34#校验结果results
35定义validate(results)函数，校验结果是否可行（validity：0.0/1.0)
36#计算分数
37定义score(results)函数，根据结果计算分数combined_score

模板使用小结

1.run_with_timeout函数定义完全不需要修改，直接使用，调用run_with_timeout函数时指定被评估代码中被调用的函数名及对应的参数。

2.要求被评估代码中被run_with_timeout调用的函数的返回值为字典。

3.在evaluate函数中调用run_with_timeout函数获取results，并根据校验函数与评分函数计算metrics。

4.在第三部分中自定义约束校验函数与评分函数。

检查清单

必须包含 evaluate 函数且该函数必须包含参数 path_user_py 且参数名不可改变。 metrics 必须包含：validity 、 combined_score 、 error_info三个字段。

初始代码模板

设计原则： 推荐把代码分成两层。

Plain Text

1* 稳定层（不可改）：I/O、数据解析、主入口函数等。
2* 可变层（可改）：你希望模型优化的算法核心，写在`# EVOLVE-BLOCK-START` 和 `# EVOLVE-BLOCK-END`内。

具体而言，初始代码需要明确两件事：

明确代码的主入口函数：
1. 主入口函数是评估器真正会去执行的函数。
2. 该函数会在评估器中，通过 run_with_timeout 被直接调用。
3. 主入口函数的返回值必须是字典（dict）：这是因为评估器模板中的 run_with_timeout 约定被调用函数的返回值必须为字典格式。
4. 主入口函数要求放在#EVOLVE-BLOCK-START 和 #EVOLVE-BLOCK-END外。
明确演化部分：
1. 可以演化的部分放在#EVOLVE-BLOCK-START 和 #EVOLVE-BLOCK-END内。
2. 与提示词的关系：需要在提示词中告诉模型如何生成函数，指明函数的输入输出要求。

以下为特征工程的例子，可以参照如下模板修改您的初始代码：

Markdown

1# EVOLVE-BLOCK-START
2def process_features(df):
3    # 生成新特征的函数
4    pass
5def helper1():
6    # 可以要求模型生成一些辅助函数，希望被修改的辅助函数放在# EVOLVE-BLOCK内
7    pass
8# EVOLVE-BLOCK-END
9
10def helper2():
11    # 不希望被修改的辅助函数放在# EVOLVE-BLOCK外
12    pass
13
14def main(input_path):
15    """
16    主入口函数:读取数据,进行特征处理,返回处理后的DataFrame
17
18    Args:
19        input_path: 输入CSV文件路径
20
21    Returns:
22        dict: 包含处理结果的字典
23            - df_feat: 处理后的DataFrame
24    """
25    # 读取数据
26    df = pd.read_csv(input_path, encoding="utf-8-sig")
27
28    # 处理特征
29    processed_df = process_features(df)
30
31    # 返回DataFrame和相关信息
32    return {
33        "df_feat": processed_df
34    }

与评估器的关系：

在评估器的evaluate函数中，需要指定被评估代码的主入口函数名，并通过args传入对应的参数。具体来说：如果你希望评估器运行初始代码中的 main 函数，需要在调用 run_with_timeout 时，将 function_name 设置为 "main"，并通过 args 传入该函数所需的参数。评估器会根据你指定的函数名和参数，在受控环境中执行对应函数，并获取运行结果用于后续评估与打分。以下为特征工程的例子：

Markdown

1def evaluate(path_user_py: str) :
2
3    #1.定义返回字典
4    
5    #2.定义被评估代码的主入口函数名及传递的参数args
6    function="main"
7    input_path = "./1_sample_train_with_labels.csv"
8    args = {"input_path": input_path} #args要求为字典格式
9    
10    #3.调用run_with_timeout函数，运行被评估代码中的函数并获取返回结果
11    try:
12        results = run_with_timeout(
13            path_user_py, function_name=function, args, timeout_seconds=2400  
14        )

提示词模板

提示词是伐谋演化的“说明书”，您需要在提示词中告诉伐谋应该如何改代码、优化方向是什么、有哪些必须遵守的限制等，从而指导伐谋按照您期望的方向改进代码。一个好的提示词至少需要包含如下五个部分：

Markdown

11. 角色与目标
2   - 我是谁
3   - 我要优化什么指标
4
52. 任务定义
6    - 详细说明任务具体背景
7
83. 数据与字段说明
9    - 输入数据各字段含义
10
114.  EVOLVE-BLOCK内代码要求
12   - 只能修改 EVOLVE-BLOCK
13   - 生成函数的输入输出要求
14   - 可以说明希望采取的具体策略，指导演化
15
165. 约束校验与评估逻辑说明
17   - 如果是组合优化问题，需要把约束详细说明，与评估器中validity的判断逻辑一一对应
18   - 介绍metrics中的指标具体如何计算的，详细介绍combined_score及其他自定义字段的具体含义

以下为特征工程案例的提示词，您可以参照该案例修改您的提示词：

Markdown

1You are a risk control algorithm expert. You need to incrementally build risk control features using a user log dataset.
2
3---
4
5## Task Description: 
6
7You need to construct new features. Each time, you are required to generate a set of new features (up to 200 features) and add them to the existing dataset. An external evaluation will return a score, along with the ten most relevant and the ten least relevant features.
8
9If adding this new set of features to the existing dataset improves the model's performance, these features will be retained. Otherwise, they will be discarded, and your current task will be considered a failure.
10
111. The new features should constitute a *single set* of features. This means using specific original columns to generate multiple new features through a *single, specific method* (e.g., the `max()` function can be applied across all columns to generate a number of new features equal to the number of input columns for a given entity like `customerID`). The number of new features must not exceed 200. You must devise a strategy to filter features either before or after their construction to meet this limit.
122. The new features must be a *single set* derived from *one calculation method*, not multiple different sets or methods combined in one attempt.
133. The new features should be different from the existing features, achieved either by using different input columns or by applying a completely different methodology.
144. New features are only retained if they enhance the model's predictive performance. The goal is to mine for effective features as much as possible.
155. The dataset you need to process is massive, approximately 5 million rows by 160 columns. You must improve the operational efficiency of feature construction (e.g., through efficient parallelization or strict input filtering). Simultaneously, due to limited resources, you need to consider how to reduce memory usage (thus, you must strictly define the number of features built each time). Your processing is subject to strict time constraints.
16
17---
18## Data Description: 
19
20You have a series of monthly records of users and information on whether they defaulted finally.
21The target binary variable is calculated by observing 18 months performance window after the latest credit card statement, and if the customer does not pay due amount in 120 days after their latest statement date it is considered a default event.
22The dataset contains aggregated profile features for each customer at each statement date. Features are anonymized and normalized, and fall into the following general categories:
23
24customer\_ID: the id of customer
25'S\_2': record date as the format of str '%Y-%m-%d'
26'target': represents the output in need as a binary feature
27
28* D\_\*: Delinquency variables
29* S\_\*: Spend variables
30* P\_\*: Payment variables
31* B\_\*: Balance variables
32* R\_\*: Risk variables
33
34### Below is the list of description of the columns in our data:
35['customer_ID', 'S_2', 'B_1', 'S_3', 'D_41', 'D_43', 'D_44', 'B_4', 'B_5', 'B_8', 'D_51', 'B_9', 'R_3', 'P_3', 'B_10', 'S_5', 'B_12', 'S_8', 'D_58', 'B_14', 'D_64', 'B_16', 'B_18', 'B_19', 'B_20', 'D_68', 'B_22', 'D_70', 'D_72', 'S_15', 'B_23', 'P_4', 'D_74', 'D_75', 'D_77', 'B_25', 'D_78', 'D_79', 'D_81', 'D_84', 'B_30', 'B_33', 'D_91', 'S_22', 'S_23', 'S_24', 'B_37', 'R_26', 'R_27', 'B_38', 'D_112', 'B_40', 'S_27', 'D_114', 'D_115', 'D_117', 'D_119', 'D_120', 'D_121', 'D_122', 'D_128', 'D_133', 'P_2', 'D_39', 'B_2', 'R_1', 'B_3', 'D_42', 'D_45', 'R_2', 'D_46', 'D_47', 'D_48', 'D_49', 'B_6', 'B_7', 'D_50', 'D_52', 'D_53', 'B_11', 'S_6', 'D_54', 'R_4', 'S_7', 'D_55', 'D_56', 'B_13', 'R_5', 'S_9', 'D_59', 'D_60', 'D_61', 'B_15', 'S_11', 'D_62', 'D_63', 'D_65', 'B_17', 'D_66', 'S_12', 'R_6', 'S_13', 'B_21', 'D_69', 'D_71', 'D_73', 'D_76', 'B_24', 'R_7', 'B_26', 'R_8', 'R_9', 'S_16', 'D_80', 'R_10', 'R_11', 'B_27', 'D_82', 'S_17', 'R_12', 'B_28', 'R_13', 'D_83', 'R_14', 'R_15', 'R_16', 'B_29', 'S_18', 'D_86', 'D_87', 'R_17', 'R_18', 'D_88', 'B_31', 'S_19', 'R_19', 'B_32', 'S_20', 'R_20', 'R_21', 'D_89', 'R_22', 'R_23', 'D_92', 'D_93', 'D_94', 'R_24', 'R_25', 'D_96', 'S_25', 'S_26', 'D_102', 'D_103', 'D_104', 'D_105', 'D_106', 'D_107', 'B_36', 'D_108', 'D_109', 'D_110', 'D_111', 'B_39', 'D_113', 'D_116', 'D_118', 'D_123', 'D_124', 'D_125', 'D_126', 'D_127', 'D_129', 'B_41', 'B_42', 'D_130', 'D_131', 'D_132', 'R_28', 'D_134', 'D_135', 'D_136', 'D_137', 'D_138', 'D_139', 'D_140', 'D_141', 'D_142', 'D_143', 'D_144', 'D_145']
36
37
38### among them
39cat_features(str) = [
40    "B_30",
41    "B_38",
42    "D_114",
43    "D_116",
44    "D_117",
45    "D_120",
46    "D_126",
47    "D_63",
48    "D_64",
49    "D_66",
50    "D_68"
51]
52
53All cat\_features are in str format.
54Others are numerical features and you should transform float64 and float32 to float16.
55
56---
57
58## Your ultimate goal is to construct a group of pratically useful features.
59
60Based on existing code(features) and evaulation score, you should consider what kind of features should be dropped and what kind of features should be added into your code.
61
62---
63
64## EVOLVE-BLOCK Writing Requirements
65
66**You should only modify the content between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END`!**
67
68### What You Need to Define Inside EVOLVE-BLOCK:
69
70Inside the EVOLVE-BLOCK, you must define a function called **`process_features(df)`** that:
71
721. **Input**: Receives a DataFrame `df` with the following structure:
73   - `customer_ID`: Customer identifier
74   - `S_2`: Record date (time series data)
75   - `target`: Binary target variable
76   - Other feature columns (D_*, S_*, P_*, B_*, R_*)
77
782. **Processing**: Implements your feature engineering logic
79   - Aggregate time series data per customer
80   - Generate new features using statistical methods
81   - Handle missing values and infinite values
82
833. **Output**: Returns a DataFrame with:
84   - **Must include**: `customer_ID` column
85   - **Must include**: `target` column
86   - **Other columns**: Your engineered features
87   - **Each row**: Represents one unique customer (aggregated from time series)
88
89---
90
91### Critical Requirements for process_features():
92
93#### ✅ Must Do:
94
951. **Return a DataFrame** with `customer_ID` and `target` columns
962. **Each row represents one unique customer** (aggregate from time series)
973. **Keep feature count under 200** (excluding customer_ID and target)
984. **Use a single feature engineering method** (e.g., all aggregations, or all difference features)
995. **Handle missing/infinite values** properly
1006. **Optimize memory usage** (convert to float32/int32)
1017. **Maintain efficient processing** (use vectorization, parallel processing when possible)
102
103#### ❌ Must NOT Do:
1041. ❌ Do NOT drop customer rows (can only drop columns)
1052. ❌ Do NOT forget to include `customer_ID` and `target` in output
1063. ❌ Do NOT generate more than 200 features
1074. ❌ Do NOT use multiple different feature engineering methods in one iteration
108
109
110### How Your Code Will Be Used:
111
112Your EVOLVE-BLOCK code will be called by a wrapper function like this:
113
114```python
115def function(input_path="./1_sample_train_with_labels.csv"):
116    df = pd.read_csv(input_path, encoding="utf-8-sig")
117    processed_df = process_features(df)  # Calls YOUR function
118
119    return {
120        "df_feat": processed_df
121    }
122
123
124The evaluator will then:
1251. Call your `function()` through `run_with_timeout()`
1262. Extract the returned DataFrame
1273. Train an XGBoost model with 5-fold cross-validation
1284. Return evaluation metrics
129
130---
131
132## Evaluation Metrics Explanation
133
134After your features are generated, the evaluator will train an XGBoost binary classification model and return the following metrics:
135
136### 1. **combined_score(same as AMEX Metric)** (Primary Score)
137- **What it is**: A custom business metric designed for credit default prediction
138- **Range**: 0.0 to 1.0 (higher is better)
139- **How it's calculated**:
140  - Combines two components: **Gini coefficient** (80%) + **Top 4% default rate** (20%)
141  - Weighted by class imbalance (default cases are weighted 20x more than non-default)
142  - Focuses on ranking performance at the top of the score distribution
143- **Formula**:
144  
145  AMEX = 0.5 * (normalized_gini + top_4_percent_capture_rate)
146  
147  - `normalized_gini`: Gini coefficient normalized by random baseline
148  - `top_4_percent_capture_rate`: Proportion of actual defaults captured in top 4% predicted
149
150- **Why it matters**: This is the **primary optimization target**. Your features should maximize this score.
151
152### 2. **AUC** (Area Under ROC Curve)
153
154### 3. **F1 Score**
155
156### 4. **Feature Importance** (avg_feature_importance)
157- **What it is**: XGBoost feature importance based on gain
158- **Format**: Dictionary with two lists
159  - `top_20`: Top 20 most important features (by gain)
160  - `bottom_20`: Bottom 20 least important features (by gain)
161- **How it's calculated**:
162  - Average gain across all 5 cross-validation folds
163  - Gain = improvement in accuracy brought by a feature to the branches it is on
164- **How to use it**:
165  - **Keep features in `top_20`**: These are highly predictive
166  - **Consider removing features in `bottom_20`**: These contribute little
167  - Use this feedback to iterate and improve your feature engineering
168
169### Cross-Validation Process:
170
171The model is trained using **5-fold Stratified Cross-Validation**:
1721. Data is split into 5 folds, maintaining class distribution
1732. For each fold:
174   - Train XGBoost on 4 folds
175   - Predict on the held-out fold
176   - Record metrics and feature importance
1773. Final metrics are averaged across all 5 folds
1784. AMEX score is calculated on the full out-of-fold predictions
179
180### Example Metrics Return:
181
182```python
183{
184    "validity": 1.0,              # 1.0 = success, 0.0 = failed
185    "combined_score": 0.7856,     # Same as AMEX (primary score)
186    "AMEX": 0.7856,               # Business metric (optimize this!)
187    "AUC": 0.8234,                # Standard AUC
188    "F1": 0.6543,                 # F1 at threshold 0.5
189    "num_new_features": 156,      # Number of features you generated
190    "avg_feature_importance": {
191        "top_20": ["B_1_mean", "D_41_std", ...],    # Most important
192        "bottom_20": ["S_8_min", "R_3_max", ...]    # Least important
193    },
194    "error_info": {}              # Empty if successful
195}
196
197
198### What To Optimize:
199
200**Primary Goal**: Maximize **AMEX** score (this is what the beam search tracks)
201
202**Secondary Goals**:
203- Maximize AUC (correlation with AMEX)
204- Generate features that appear in `top_20` importance
205- Avoid generating too many features in `bottom_20`
206
207## Final Reminders:
208
2091. **Only modify EVOLVE-BLOCK** - The function() wrapper is fixed
2102. **Must return DataFrame** with customer_ID and target columns
2113. **Feature limit: 200** (excluding customer_ID and target)
2124. **Single method per iteration** - Don't combine multiple strategies
2135. **Optimize for speed and memory** - Use efficient pandas operations
2146. **Each row = one customer** - Aggregate time series data properly
2157. **Handle missing/inf values** - Don't let them break the pipeline
2168. **Meaningful feature names** - Use format like `{column}_{method}`
217
218You should only modify the content between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END`! However, it must be emphasized that you must generate the complete code, including the unmodified parts outside EVOLVE-BLOCK.

常见Q&A

Q:evaluator.py如何导入import其他.py文件?

A:支持。使用 sys.path.append(current_dir)可导入同目录下的辅助模块，以下为具体的例子。

                Bash
                
            

                current_dir = os.path.dirname(os.path.abspath(__file__))
sys.path.append(current_dir)
# evaluator.py从同目录下的utils.py中导入相关函数
from utils import (
    CFG,
    load_beam,
    save_beam,
    merge_features,
    add_features,
    calculate_feature_overlap_rate,
    train_and_cv
)
            

Q:init.py如何导入其他.py文件？

A:目前不支持init.py导入其他.py文件。

Q init.py或evaluator.py如何读取数据文件？

A:始终使用相对于evaluator.py文件的相对路径。init.py中读取文件也要采用相对于evaluator.py文件的相对路径。

Q:超时错误如何解决？

A:系统对初始代码的总运行时间有硬性要求，上限为 3600秒。解决超时错误需要将 run with timeout 函数的 timeout seconds 参数设置在 3600 秒以内(推荐值如 2400s)。

进阶部分

写好评估器的关键在于设计好evaluate函数的返回值metrics。在该部分，我们将详细介绍metrics各字段的设计原则。

                Bash
                
            

                def evaluate(path_user_py: str) :#函数名不可变，path_user_py参数名不可变
    #1.定义返回的字典
    metrics= {
        "validity": 0.0, #必须，0/1，判断生成的代码的有效性
        "combined_score": 0.0, #必须，float,判断生成的代码的质量，分数越高越好
        "feature1": 0.0,  #添加任何你关注的指标，可以没有,变量名自定义
        "feature2": 0.0, #添加任何你关注的指标，可以没有,变量名自定义
        "error_info": {}#必须，字典格式，在生成的代码无效的时候，需要在这个字段中说明为什么无效
    }
            

1.combined_score设计原则：同向、同量纲

combined_score，一定要正向、同量纲，而不是简单把原始指标直接加权相加。以量化为例子：

在量化中，会关注AnnRet、Sharpe、Max DD、Calmar等指标：
- 年化收益率：AnnRet（越大越好）
- Sharpe：Sharpe（越大越好）
- 最大回撤：Max DD（越小越好）
- Calmar：年化收益 / 最大回撤（越大越好）
在设计combined_score的时候就需要考虑以下几个点：
- 同向：最大回撤为负向指标，需要调整为正向指标。
- 同量纲：AnnRet、Sharpe、Max DD、Calmar是否同量纲，刚好这个例子中为同量纲，所以不需要调整。
- 权重：AnnRet、Sharpe、Max DD、Calmar中用户最关注的是哪个指标，比如说用户极度风险厌恶，那么最能衡量风险厌恶水平的Max DD就应该给更高的权重。具体可参照如下写法：

Markdown

1combined_score = (
2    0.1 * ann_ret
3    + 0.2 * sharpe
4    + 0.5 / max_dd
5    + 0.2 * calmar
6) * validity

2.combined_score与validity

硬约束应该放在validity字段进行判断
- 什么是硬约束：如果违反了某个约束条件，会导致解完全不可用，则该约束就是硬约束。
- 硬约束应该设置为validity的判断条件，如果违反该约束，则validity应该设置为0。
- validity=0的解的combined_score值应该设置为0。
  - combined_score的计算方式最好为【score】*validity。
软约束应该放在combined_score字段进行惩罚
- 什么是软约束：如果违反了某个约束条件，导致解的质量降低，则该约束为软约束。
- 软约束可以以惩罚项的形式体现在combined_score的计算逻辑中。

3.validity与error_info

在validity=1的时候，error_info为空字典{}。
在validity=0的时候，需要自定义返回error_info的信息：
- 这一步非常重要‼️只有把具体的错误信息告诉大模型，大模型才知道后续的改进方向。
  - 如果是代码报错，需要返回报错信息，包括异常处理机制。
  - 如果是违反了约束条件，也需要在error_info中写清楚违反的具体是什么条件。
量化例子：
- 若用户希望满足如下的硬性约束条件：1）不能做空；2）投资权重和为1。
- 约束检验不通过时，同样需要返回对应错误信息'error_info'，具体可以参照如下写法：

Markdown

1# 约束 1：不允许负权重（无做空）
2if any(w < 0 for w in weights):
3    metrics["validity"] =0.0
4    metrics['error_info'] = {
5        "w_negative": "Portfolio contains negative weights; each w_i must satisfy w_i >= 0 (no short positions allowed)"
6    }
7# 约束 2：权重需要和为 1（全额投资）
8if abs(sum(weights) - 1.0) > 1e-8:
9    metrics["validity"] =0.0
10    metrics['error_info'] = {
11        "w_sum_not_one": "Sum of portfolio weights must be 1.0 (fully invested constraint); current sum deviates from 1.0 beyond tolerance."
12    }

4.自定义feature：尽量把中间指标都返回

除了必须返回的combined_score、validity、error_info，您可以将combined_score的组成项得分都返回，从而给到大模型更多信息用于演进优化，更好的提升combined_score。

原因：假设combined_score从1提升到1.2，如果不返回中间指标，大模型不知道提升是来自哪个方向的改进，不利于找到后续的优化方向。
在量化的例子中，一个策略的combined_score从1提升到1.2可能是来自Sharpe 的提高，也可能是来自AnnRet的提高，只有把中间指标都返回，大模型才知道提高的具体来源，从而找到后续的优化方向。

Markdown

1def evaluate(path_user_py: str) :
2
3    #定义返回的字典
4    metrics= {
5        "validity": 0.0, 
6        "combined_score": 0.0, 
7        "ann_ret": 0.0,  #把中间指标都返回
8        "max_dd": 0.0, 
9        "calmar": 0.0, 
10        "error_info": {}
11    }

IDE查看技巧

为什么要看IDE?

在使用本产品时，IDE是开发者与系统互动的核心工具。理解IDE的结构和工作流程能够帮助您更高效地定位问题、进行代码编辑、管理输入输出文件以及进行调试。关注并理解IDE结构可以提升开发效率、快速定位并了解代码进化流程并且优化项目管理。

IDE框架

下面以一个case做示例，看一下它的目录包含什么

Plain Text

1/ (根目录)                                # 项目的根目录
2  ├── input/                                # 输入数据目录，存放您上传的原始工程文件
3  │   ├── point_case                        # 数据样例 / 测试用例文件
4  │   ├── config.yaml                       # 配置文件，包含演化参数、岛屿数量等设置
5  │   ├── evaluator.py                      # 您上传的评估器
6  │   └── init_program.py                   # 您上传的待改进算法（初始程序）
7  │
8  ├── output/                               # 输出目录，存放整个演化过程产生的所有个体
9  │   ├── programs/                         # 演化阶段生成的所有个体
10  │   │   ├── 1_0_1_2026051321099.json      # 节点1 · 岛屿 0 · 代际 1 · 时间戳 的个体属性参数
11  │   │   ├── 1_0_1_2026051321099.py        # 与上同名的 .py 是该个体的算法代码
12  │   │   ├── 2_1_1_2026051321101.json      # 节点2 · 岛屿 1 
13  │   │   ├── 2_1_1_2026051321101.py
14  │   │   ├── 3_2_1_2026051321101.json      # 节点3 · 岛屿 2 
15  │   │   ├── 3_2_1_2026051321101.py
16  │   │   ├── 4_3_1_2026051321102.json      # 节点4 · 岛屿 3 
17  │   │   ├── 4_3_1_2026051321102.py
18  │   │   ├── ...                           
19  │   │   └── 100_3_3_2026051408109.py      # 最后一代个体
20  │   ├── init_island_0.json                # 岛屿 0 的初始种子程序属性
21  │   ├── init_island_0.py                  # 岛屿 0 的初始种子程序代码
22  │   ├── init_island_1.json                # 岛屿 1 的初始种子程序
23  │   ├── init_island_1.py
24  │   ├── init_island_2.json                # 岛屿 2 的初始种子程序
25  │   ├── init_island_2.py
26  │   ├── init_island_3.json                # 岛屿 3 的初始种子程序
27  │   └── init_island_3.py
28  │
29  ├── data.json                             # 全局元数据，记录本次演化的整体配置与统计信息
30  ├── experiment.log                        # 实验日志（包含大模型请求、演化过程关键事件）
31  └── report.html                           # 「结果洞察」报告的 HTML 格式

input/ 是您上传的原始工程文件夹，包含初始算法、评估器、数据和配置文件，演化过程不会修改这里的内容。
output/programs/ 是所有演化中产生的个体，全部在同一个文件夹下平铺，便于检索和对比；每个个体由同名的 .json（属性参数）和 .py（算法代码）配对构成。个体文件名采用 {节点}{岛屿号}{代际}_{时间戳} 的编码方式，例如 5_0_4_2026051321104.json 表示「节点5、岛屿 0、代际4、生成于该时间戳」。
output/init_island_N.py 是第 N 个岛屿的初始种子程序，岛屿数量由 config.yaml 中的 num_islands 决定（图示为 4 个岛屿）。
data.json 是整个实验的全局元数据，包含最优个体、各代得分、岛屿状态等汇总信息。
experiment.log 记录了大模型请求、演化阶段切换、个体生成与评估等关键运行时事件，定位问题首选查看这里。
report.html 是可视化的结果洞察报告，直接在浏览器中打开即可查看演化曲线、最优解代码、岛屿分布等信息。

快速体验

百度伐谋·管理平台现已正式开放使用，所有用户均可直接登录体验。通过使用伐谋，您将体验到我们的创新功能，并与我们共同推动行业技术的前沿发展。

流程

打开浏览器输入workspace.famou.com (将默认跳转到百度智能云-伐谋落地页 https://cloud.baidu.com/product/famou.html )，之后点击「管理平台」。

跳转到百度伐谋·管理平台登陆页，并在登陆页使用百度智能云账号进行登陆。

3.之后即可进入百度伐谋·管理平台，开启您的算法演化之旅。

Note：伐谋已结束内测阶段，现已全面开放，所有用户登录后即可立即使用，无需提交申请或等待白名单开通。如在使用过程中遇到任何问题，欢迎通过官方渠道联系我们。

评价此篇文章

有帮助没帮助

技术框架

案例实践

百度智能云

百度伐谋公有云 FAMOU

百度伐谋公有云 FAMOU

使用技巧

三步上手

使用模板

评估器模板

模板概述

模板详细说明

2.1 运行被评估代码（`run_with_timeout`函数）

2.2 主评估函数（`evaluate`函数）

2.3 自定义约束检测与评分逻辑

模板使用小结

检查清单

初始代码模板

提示词模板

常见Q&A

进阶部分

1.combined_score设计原则：同向、同量纲

2.combined_score与validity

3.validity与error_info

4.自定义feature：尽量把中间指标都返回

IDE查看技巧

为什么要看IDE?

IDE框架

快速体验

流程

百度伐谋公有云 FAMOU

使用技巧

三步上手

使用模板

评估器模板

模板概述

模板详细说明

2.1 运行被评估代码（run_with_timeout函数）

2.2 主评估函数 （evaluate函数）

2.3 自定义约束检测与评分逻辑

模板使用小结

检查清单

初始代码模板

提示词模板

常见Q&A

进阶部分

1.combined_score设计原则：同向、同量纲

2.combined_score与validity

3.validity与error_info

4.自定义feature：尽量把中间指标都返回

IDE查看技巧

为什么要看IDE?

IDE框架

快速体验

流程

2.1 运行被评估代码（`run_with_timeout`函数）

2.2 主评估函数（`evaluate`函数）