模型超参配置
一、模型配置(model_conf)
1、 model: 指定了使用的模型类型。这里使用了built-in|KVFM,可能表示一个内置的键值对因子分解机(Key-Value Factorization Machines, KVFM)模型。注意,这里的命名可能不是标准的,因为常见的因子分解机模型包括FM(Factorization Machines)和FFM(Field-aware Factorization Machines),而KVFM可能是一个特定实现或变种。
2、 table_conf: 用于配置嵌入表(embedding tables)的参数,这在处理分类特征时非常常见。每个嵌入表都有一个名称、维度、零索引列表、分片数以及初始化器。
1* name: 嵌入表的名称。
2* dim: 嵌入向量的维度。
3* zero_indexes: 指定哪些索引在嵌入表中应该被初始化为零向量。这通常用于表示缺失值或特定的占位符。
4* shard_num: 嵌入表的分片数,用于分布式训练时分散嵌入表的计算。
5* initializer: 嵌入向量的初始化方法,这里使用了均匀随机初始化,并指定了最小值和最大值。
二、编译配置(compile_conf)
1、 optimizer: 优化器的类型,这里使用了Adam优化器。
2、 lr: 学习率,设置为1e-4。
3、 l2_decay: L2正则化项的系数,用于防止过拟合。这里设置为1.0,但通常这个值会远小于1,可能需要调整。
三、运行配置(run_conf)
1、 runner: 指定了运行模式的类型,这里使用了train_eval_save,表示模型将进行训练、评估和保存。
2、 epoch: 训练轮次。
3、 train_print_period: 训练过程中打印日志的周期(以样本数为单位)。
4、 train_debug: 是否在训练过程中启用调试模式。
5、 valid_print_period: 验证过程中打印日志的周期(以样本数为单位)。
6、 valid_debug: 是否在验证过程中启用调试模式。
7、 save_dir: 模型保存目录。
8、 save_mode: 模型保存模式,这里使用了save_pipeline_model,可能表示保存整个训练流程中的模型,包括预处理和训练后的模型。
9、 logdir: 日志文件保存的目录。
四、数据配置(data_conf)
1、 train_dataset 和 val_dataset: 分别配置了训练数据集和验证数据集的参数。
1* mode: 数据集的处理模式,这里都使用了InMemoryDataset,表示数据将被加载到内存中。
2* shuffle: 是否在训练前打乱数据。
3* thread: 数据加载时使用的线程数。
4* batch_size: 每个批次的数据量。
5* dirs: 数据集所在的目录。注意这里训练集和验证集都指向了同一个目录(./valid_dataset),这在实际应用中可能不是最佳实践,因为通常我们希望验证集是训练集的一个独立子集。
2、json_template: 指定了数据模板文件的路径(sample_data.json),这个文件可能用于定义数据的格式和结构。
五、注意事项
1、 配置文件中的某些参数(如l2_decay的值)可能需要根据实际情况进行调整。
2、训练集和验证集使用相同的目录可能不是最佳实践,应该确保验证集是独立于训练集的。
3、 嵌入表的维度和分片数应该根据问题的复杂性和可用的计算资源进行调整。
4、如果使用的是自定义模型(如lr.py|MyLR),则需要确保相应的Python文件(lr.py)和类(MyLR)已经正确实现并可用。
六、模型参数
- LR
1fe_conf: slot_list.conf
2model_conf:
3 model: built-in|KVLR
4 table_conf:
5 - dim: 1
6 initializer:
7 attrs:
8 max: 0.01
9 min: -0.01
10 type: uniform_random
11 name: embedding
12 shard_num: 20
13 zero_indexes:
14 - 0
15optimizer_conf:
16 optimizer: adam
17 lr: 1e-4
18 l2_decay: 0.001
19
20run_conf:
21 runner: train_eval_save
22 epoch: 1
23 train_print_period: 100
24 train_debug: false
25 valid_print_period: 4
26 valid_debug: false
27 # load_dir: save_dir/epoch_0
28 save_dir: save_dir
29 save_mode: save_pipeline_model
30 logdir: logs
31 logfile: train.log
32
33data_conf:
34 train_dataset:
35 mode: InMemoryDataset
36 shuffle: True
37 thread: 8
38 batch_size: 128
39 dirs:
40 - /home/work/data/
41 val_dataset:
42 mode: InMemoryDataset
43 shuffle: False
44 thread: 8
45 batch_size: 128
46 dirs:
47 - /home/work/data/
- MMOE
1fe_conf: slot_list.conf
2model_conf:
3 model: built-in|KVMMOE
4 table_conf:
5 - name: tag_embedding
6 dim: 16
7 shard_num: 20
8 zero_indexes: [0]
9 initializer:
10 type: uniform_random
11 attrs:
12 min: -0.01
13 max: 0.01
14 loss_weights: [1]
15 task_modes: ["classification"]
16 tower_sizes: [8]
17
18optimizer_conf:
19 optimizer: adam
20 lr: 1e-4
21
22run_conf:
23 runner: train_eval_save
24 epoch: 10
25 train_print_period: 100
26 train_debug: False
27 valid_print_period: 100
28 valid_debug: False
29 load_dir: ""
30 save_dir: "save_dir"
31
32data_conf:
33 json_template: json_template
34 train_dataset:
35 mode: InMemoryDataset
36 shuffle: True
37 thread: 1
38 batch_size: 128
39 dirs:
40 - train-data
41 val_dataset:
42 mode: InMemoryDataset
43 shuffle: False
44 thread: 1
45 batch_size: 128
46 dirs:
47 - train-data
- FM
1fe_conf: slot_list.conf
2
3model_conf:
4 model: built-in|KVFM
5 table_conf: #remove table_conf if you don't use kv-table.
6 - name: first_embedding
7 dim: 1
8 shard_num: 20
9 zero_indexes: [0]
10 initializer:
11 type: uniform_random
12 attrs:
13 min: -0.01
14 max: 0.01
15 - name: second_embedding
16 dim: 16
17 shard_num: 20
18 zero_indexes: [0]
19 initializer:
20 type: uniform_random
21 attrs:
22 min: -0.01
23 max: 0.01
24
25optimizer_conf:
26 optimizer: adam
27 lr: 1e-4
28
29run_conf:
30 runner: train_eval_save
31 epoch: 10
32 train_print_period: 100
33 train_debug: False
34 valid_print_period: 100
35 valid_debug: False
36 load_dir: "save_dir"
37 save_dir: "save_dir"
38
39data_conf:
40 json_template: json_template
41 train_dataset:
42 mode: InMemoryDataset
43 shuffle: True
44 thread: 1
45 batch_size: 128
46 dirs:
47 - /home/work/data/
48 val_dataset:
49 mode: InMemoryDataset
50 shuffle: False
51 thread: 1
52 batch_size: 128
53 dirs:
54 - /home/work/data/
- DssmRank
1model_conf:
2 model: dssm_rank|KvDssmRank
3 left_hidden_size: [10]
4 right_hidden_size: [10]
5 table_conf:
6 - dim: 64
7 initializer:
8 attrs:
9 max: 0.01
10 min: -0.01
11 type: uniform_random
12 name: embedding
13 shard_num: 20
14 zero_indexes:
15 - 0
16
17optimizer_conf:
18 optimizer: adam
19 lr: 1e-4
20 l2_decay: 0.001
21
22run_conf:
23 runner: train_eval_save
24 epoch: 1
25 train_print_period: 100
26 train_debug: false
27 valid_print_period: 4
28 valid_debug: false
29 load_dir: save_dir/epoch_0
30 save_dir: save_dir
31 save_mode: save_pipeline_model
32 logdir: logs
33 logfile: train.log
34
35data_conf:
36 train_dataset:
37 mode: InMemoryDataset
38 shuffle: True
39 thread: 8
40 batch_size: 128
41 dirs:
42 - train-data
43 val_dataset:
44 mode: InMemoryDataset
45 shuffle: False
46 thread: 1
47 batch_size: 128
48 dirs:
49 - train-data
- DeepFM
1e_conf: slot_list.conf
2
3model_conf:
4 model: built-in|KVDeepFM
5 table_conf: #remove table_conf if you don't use kv-table.
6 - name: embedding_1
7 dim: 100
8 zero_indexes: [0]
9 initializer:
10 type: uniform_random
11 attrs:
12 min: -0.01
13 max: 0.01
14 - name: embedding_2
15 dim: 100
16 zero_indexes: [0]
17 initializer:
18 type: uniform_random
19 attrs:
20 min: -0.01
21 max: 0.01
22 hidden_size: [256, 128, 64]
23
24optimizer_conf:
25 optimizer: adam
26 lr: 1e-4
27
28run_conf:
29 runner: train_eval_save
30 epoch: 10
31 train_print_period: 100
32 train_debug: False
33 valid_print_period: 100
34 valid_debug: False
35 load_dir: "save_dir"
36 save_dir: "save_dir"
37
38data_conf:
39 json_template: json_template
40 train_dataset:
41 mode: InMemoryDataset
42 shuffle: True
43 thread: 1
44 batch_size: 128
45 dirs:
46 - /home/work/data/
47 val_dataset:
48 mode: InMemoryDataset
49 shuffle: False
50 thread: 1
51 batch_size: 128
52 dirs:
53 - /home/work/data/
- WideDeep
1fe_conf: slot_list.conf
2
3model_conf:
4 model: built-in|KVNRMS
5 table_conf: #remove table_conf if you don't use kv-table.
6 - name: title_embedding
7 #load_dir: "kv_emb_all"
8 dim: 300
9 zero_indexes: [0]
10 initializer:
11 type: uniform_random
12 attrs:
13 min: -0.01
14 max: 0.01
15
16 n_heads: 20
17 head_dim: 20
18 atten_hidden_dim: 200
19 hist_len: 5
20 title_len: 3
21 cand_len: 2
22 #place: cuda
23
24optimizer_conf:
25 optimizer: adam
26 lr: 1e-4
27 # l2_decay: 0.001 nrms关闭l2_decay
28
29run_conf:
30 runner: train_eval_save
31 epoch: 1
32 train_print_period: 1
33 train_debug: False
34 valid_print_period: 1
35 valid_debug: False
36 logdir: "expr/log_nrms"
37 load_dir: ""
38 save_dir: "save_dir"
39 save_mode: "save_pipeline_model"
40
41data_conf:
42 train_dataset:
43 mode: InMemoryDataset
44 shuffle: True
45 thread: 8
46 batch_size: 128
47 dirs:
48 - train-data
49 val_dataset:
50 mode: InMemoryDataset
51 shuffle: False
52 thread: 1
53 batch_size: 128
54 dirs:
55 - train-data
- NRMS
1fe_conf: slot_list.conf
2
3model_conf:
4 model: built-in|KVNRMS
5 table_conf: #remove table_conf if you don't use kv-table.
6 - name: title_embedding
7 #load_dir: "kv_emb_all"
8 dim: 300
9 zero_indexes: [0]
10 initializer:
11 type: uniform_random
12 attrs:
13 min: -0.01
14 max: 0.01
15
16 n_heads: 20
17 head_dim: 20
18 atten_hidden_dim: 200
19 hist_len: 5
20 title_len: 3
21 cand_len: 2
22 #place: cuda
23
24optimizer_conf:
25 optimizer: adam
26 lr: 1e-4
27 # l2_decay: 0.001 nrms关闭l2_decay
28
29run_conf:
30 runner: train_eval_save
31 epoch: 1
32 train_print_period: 1
33 train_debug: False
34 valid_print_period: 1
35 valid_debug: False
36 logdir: "expr/log_nrms"
37 load_dir: ""
38 save_dir: "save_dir"
39 save_mode: "save_pipeline_model"
40
41data_conf:
42 train_dataset:
43 mode: InMemoryDataset
44 shuffle: True
45 thread: 8
46 batch_size: 128
47 dirs:
48 - train-data
49 val_dataset:
50 mode: InMemoryDataset
51 shuffle: False
52 thread: 1
53 batch_size: 128
54 dirs:
55 - train-data
- Dssm
1fe_conf: slot_list.conf
2model_conf:
3 model: dssm_rank|KvDssmRank
4 left_hidden_size: [10]
5 right_hidden_size: [10]
6 table_conf:
7 - dim: 64
8 initializer:
9 attrs:
10 max: 0.01
11 min: -0.01
12 type: uniform_random
13 name: embedding
14 shard_num: 20
15 zero_indexes:
16 - 0
17
18optimizer_conf:
19 optimizer: adam
20 lr: 1e-4
21 l2_decay: 0.001
22
23run_conf:
24 runner: train_eval_save
25 epoch: 1
26 train_print_period: 100
27 train_debug: false
28 valid_print_period: 4
29 valid_debug: false
30 load_dir: save_dir/epoch_0
31 save_dir: save_dir
32 save_mode: save_pipeline_model
33 logdir: logs
34 logfile: train.log
35
36data_conf:
37 train_dataset:
38 mode: InMemoryDataset
39 shuffle: True
40 thread: 8
41 batch_size: 128
42 dirs:
43 - train-data
44 val_dataset:
45 mode: InMemoryDataset
46 shuffle: False
47 thread: 1
48 batch_size: 128
49 dirs:
50 - train-data
八、训练日志
1[test@dev-baidu-01 logs]$ cat train.log
22024-01-05 17:44:27,762-INFO: local: set random seed as 1000
32024-01-05 17:44:27,763-INFO: local: vdl records will be write in vdl_local
42024-01-05 17:44:27,864-INFO: local: running on host: ae84503c2387, ip: 192.168.42.5, dir: /home/work/iptv/mmoe/model_5000w
52024-01-05 17:44:27,865-INFO: local: init running config done
62024-01-05 17:44:28,478-INFO: local: Loading checkpoint with only incremental table: run_conf.load_dir is empty
72024-01-05 19:15:46,795-INFO: local: init dataset with InMemoryDataset, data size: 53466444, load in memory using: 5437.415941, shuffle using: 40.845384
82024-01-05 19:15:56,411-INFO: local: init dataset with InMemoryDataset, data size: 50000, load in memory using: 9.547343, shuffle using: 0.060682
92024-01-05 19:15:56,411-INFO: local: init model and reader done and start running epochs
102024-01-05 19:37:43,719-INFO: local: {"time": "2024-01-05 19:37:43", "run_type": "train", "batch_id": 200000, task 0 auc[1]:[0.749406], task 1 auc[1]:[0.944328]}
112024-01-05 20:04:17,671-INFO: local: {"time": "2024-01-05 20:04:17", "run_type": "train", "batch_id": 400000, task 0 auc[1]:[0.763899], task 1 auc[1]:[0.964906]}
122024-01-05 20:25:09,876-INFO: local: {"time": "2024-01-05 20:25:09", "run_type": "train", "batch_id": 600000, task 0 auc[1]:[0.772776], task 1 auc[1]:[0.973489]}
132024-01-05 20:46:00,475-INFO: local: {"time": "2024-01-05 20:46:00", "run_type": "train", "batch_id": 800000, task 0 auc[1]:[0.779014], task 1 auc[1]:[0.977674]}
142024-01-05 20:49:43,556-INFO: local: finish training epoch 0 in 5627.152987s, metrics: {"task 0 auc":0.779906,"task 1 auc":0.978272}
152024-01-05 20:49:43,567-INFO: local: Saving checkpoint with save_dir/epoch_0
162024-01-05 20:49:43,578-INFO: local: save serving kv model
172024-01-05 20:49:43,589-INFO: local: Dumped tag_embedding, total feature number 865861, cost time 0.032322
182024-01-05 20:49:43,806-INFO: local: Dumped tag_embedding, total feature number 865861, cost time 0.05661
192024-01-05 20:49:43,906-INFO: local: finish saving epoch 0 save model at: save_dir/epoch_0
202024-01-05 20:49:44,207-INFO: local: {"time": "2024-01-05 20:49:44", "run_type": "valid", "batch_id": 10000, task 0 auc[1]:[0.783166], task 1 auc[1]:[0]}
212024-01-05 20:49:44,408-INFO: local: {"time": "2024-01-05 20:49:44", "run_type": "valid", "batch_id": 20000, task 0 auc[1]:[0.781714], task 1 auc[1]:[1]}
222024-01-05 20:49:44,709-INFO: local: {"time": "2024-01-05 20:49:44", "run_type": "valid", "batch_id": 30000, task 0 auc[1]:[0.786361], task 1 auc[1]:[1]}
232024-01-05 20:49:44,910-INFO: local: {"time": "2024-01-05 20:49:44", "run_type": "valid", "batch_id": 40000, task 0 auc[1]:[0.786127], task 1 auc[1]:[0.999788]}
242024-01-05 20:49:45,111-INFO: local: {"time": "2024-01-05 20:49:45", "run_type": "valid", "batch_id": 50000, task 0 auc[1]:[0.784533], task 1 auc[1]:[0.999859]}
252024-01-05 20:49:45,311-INFO: local: {"time": "2024-01-05 20:49:45", "run_type": "valid", "batch_id": 60000, task 0 auc[1]:[0.780083], task 1 auc[1]:[0.999906]}
262024-01-05 20:49:45,512-INFO: local: {"time": "2024-01-05 20:49:45", "run_type": "valid", "batch_id": 70000, task 0 auc[1]:[0.779144], task 1 auc[1]:[0.999907]}
272024-01-05 20:49:46,308-INFO: local: finish evaluating epoch 0 in 1.805703s, metrics: {"task 0 auc":0.779955,"task 1 auc":0.999904}
282024-01-05 20:49:46,309-INFO: local: epoch: 0 start_at: Fri Jan 5 19:15:56 2024 end_at: Fri Jan 5 20:49:45 2024
292024-01-05 20:50:25,716-INFO: local: all running finished
这个日志文件(train.log)记录了一个机器学习或深度学习模型训练过程中的详细信息。以下是对日志中关键信息的解析和概述:
-
初始化设置:
- 日志开始时,设置了随机种子(set random seed as 1000),这有助于实验的复现性。
- 指定了VisualDL(一个可视化工具)的记录路径(vdl_local)。
- 记录了运行的主机信息(host: ae84503c2387, ip: 192.168.42.5)和模型运行的目录(/home/work/iptv/mmoe/model_5000w)。
- 初始化运行配置完成(init running config done)。
-
数据集加载:
- 加载了两个数据集,一个大小为53466444(可能是训练集),另一个大小为50000(可能是验证集或测试集)。这两个数据集都被加载到内存中,并记录了加载和打乱数据所需的时间。
-
模型训练:
- 初始化模型和读取器后,开始运行训练周期(epochs)。
- 记录了每200000个批次(batch_id)的AUC(Area Under the Curve,一种常用的分类评估指标)值,分别针对两个不同的任务(task 0和task 1)。
- 完成了第一个训练周期(epoch 0),并记录了总训练时间(5627.152987s)。
-
模型保存和评估:
- 训练完成后,保存了第一个周期的模型到save_dir/epoch_0。
- 保存了服务所需的KV模型。
- 记录了特征嵌入(tag_embedding)的保存过程,包括特征数量和所需时间。
- 进行了模型评估,记录了不同批次在验证集上的AUC值,并计算了整体的AUC值。
-
运行结束:
- 记录了训练周期的开始和结束时间。
- 指出所有运行已完成(all running finished)。
