开源模型token计算说明
更新时间:2025-05-14
token长度获取方式
千帆提供token计算器,用户可以登录token计算器页面,获取文本、图片的token长度。
开源模型token计算方法
-
以deepseek-v3为例,在huggingface上下载模型token计算相关的两个文件,分别是:
- tokenizer.json
- tokenizer_config.json
- 创建model_tokenizer.py文件,该文件与上面下载的两个文件放在同一个目录下。
- model_tokenizer.py代码如下:
Python
1# pip3 install transformers
2# python3 model_tokenizer.py
3import transformers
4
5chat_tokenizer_dir = "./"
6
7tokenizer = transformers.AutoTokenizer.from_pretrained(
8 chat_tokenizer_dir, trust_remote_code=True
9 )
10text = "开源模型token计算说明"
11
12result = tokenizer.encode(text)
13print("ids:",result)
14count = len(result)
15print("token数量:",count)
- 运行model_tokenizer.py文件,输出结果如下:
Plain Text
1ids: [83649, 8842, 33912, 4339, 6977]
2token数量: 5
- 由此表明,"开源模型token计算说明"的token数量为5,并且给出了每一个token的id,用户可以通过id在tokenizer.json文件当中找到其对应的字符含义。
复杂输入的token计算方式
- 当你的输入当中有多轮对话,又有tools工具定义,此时如果要计算token长度,需要借助tokenizer_config.json文件当中chat_template定义。
- 以qwen3-8b为例,其输入如下:
JSON
1{
2 "model": "qwen3-8b",
3 "messages": [
4 {
5 "role": "user",
6 "content": "查一下上海和北京现在的天气"
7 }
8 ],
9 "tools": [{
10 "type": "function",
11 "function": {
12 "name": "get_current_weather",
13 "description": "天气查询工具",
14 "parameters": {
15 "properties": {
16 "location": {
17 "description": "地理位置,精确到区县级别",
18 "type": "string"
19 },
20 "time": {
21 "description": "时间,格式为YYYY-MM-DD",
22 "type": "string"
23 }
24 },
25 "type": "object"
26 }
27 }
28
29 }],
30 "stream": false,
31 "enable_thinking":false,
32 "tool_choice" : "auto",
33 "tool_options" : {"thoughts_output" : true}
34}
- 经过chat_template转换以后变为如下结构:
Plain Text
1<|im_start|>system
2# Tools
3
4You may call one or more functions to assist with the user query.
5
6You are provided with function signatures within <tools></tools> XML tags:
7<tools>
8{"type": "function", "function": {"name": "get_current_weather", "description": "天气查询工具", "parameters": {"properties": {"location": {"description": "地理位置,精确到区县级别", "type": "string"}, "time": {"description": "时间,格式为YYYY-MM-DD", "type": "string"}}, "type": "object"}}}
9</tools>
10
11For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
12<tool_call>
13{"name": <function-name>, "arguments": <args-json-object>}
14</tool_call><|im_end|>
15<|im_start|>user
16查一下上海和北京现在的天气<|im_end|>
17<|im_start|>assistant
18<think>
19
20</think>
- 计算上述文本token长度代码如下:
Python
1# pip3 install transformers
2# python3 model_tokenizer.py
3import transformers
4
5chat_tokenizer_dir = "./"
6
7tokenizer = transformers.AutoTokenizer.from_pretrained(
8 chat_tokenizer_dir, trust_remote_code=True
9 )
10
11text = """<|im_start|>system
12# Tools
13
14You may call one or more functions to assist with the user query.
15
16You are provided with function signatures within <tools></tools> XML tags:
17<tools>
18{"type": "function", "function": {"name": "get_current_weather", "description": "天气查询工具", "parameters": {"properties": {"location": {"description": "地理位置,精确到区县级别", "type": "string"}, "time": {"description": "时间,格式为YYYY-MM-DD", "type": "string"}}, "type": "object"}}}
19</tools>
20
21For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
22<tool_call>
23{"name": <function-name>, "arguments": <args-json-object>}
24</tool_call><|im_end|>
25<|im_start|>user
26查一下上海和北京现在的天气<|im_end|>
27<|im_start|>assistant
28<think>
29
30</think>"""
31
32result = tokenizer.encode(text)
33
34print("ids:",result)
35count = len(result)
36print("token数量:",count)
- 运行model_tokenizer.py文件,输出结果如下:
Plain Text
1ids: [151644, 8948, 198, 2, 13852, 271, 2610, 1231, 1618, 825, 476, 803, 5746, 311, 7789, 448, 279, 1196, 3239, 382, 2610, 525, 3897, 448, 729, 32628, 2878, 366, 15918, 1472, 15918, 29, 11874, 9492, 510, 27, 15918, 397, 4913, 1313, 788, 330, 1688, 497, 330, 1688, 788, 5212, 606, 788, 330, 455, 11080, 69364, 497, 330, 4684, 788, 330, 104307, 51154, 102011, 497, 330, 13786, 788, 5212, 13193, 788, 5212, 2527, 788, 5212, 4684, 788, 330, 111692, 3837, 108639, 26939, 23836, 24342, 105972, 497, 330, 1313, 788, 330, 917, 14345, 330, 1678, 788, 5212, 4684, 788, 330, 20450, 3837, 68805, 17714, 28189, 18506, 40175, 497, 330, 1313, 788, 330, 917, 9207, 2137, 330, 1313, 788, 330, 1700, 30975, 532, 522, 15918, 1339, 2461, 1817, 729, 1618, 11, 470, 264, 2951, 1633, 448, 729, 829, 323, 5977, 2878, 220, 151657, 151658, 11874, 9492, 510, 151657, 198, 4913, 606, 788, 366, 1688, 11494, 8066, 330, 16370, 788, 366, 2116, 56080, 40432, 31296, 151658, 151645, 198, 151644, 872, 198, 32876, 100158, 100633, 33108, 68990, 104718, 104307, 151645, 198, 151644, 77091, 198, 151667, 271, 151668]
2token数量: 181
- 用户输入token长度为181,与大模型推理结果返回的token长度一致。