参数说明

更新时间：2024-12-10

参数说明

内置量化工具参数

参数名	描述	取值类型	是否必选	默认值	可选值
-i	原始模型权重输入路径	str	是
-o	量化后的模型权重输出路径	str	是
-quant_type	量化算法	str	是		+ weight_only_int8 + smooth_quant + awq + gptq
-tp	服务部署的GPU卡数	int	是		1、2、4、8
-t	指定非量化的部分存储类型	str	是		fp16、bf16
-sq	指定smooth-quant的量化smoother参数	float	否		取值范围[0-1]之间小数(llama默认是0.8， glm130是0.75)
-token	特殊需求时，smooth-quant量化需要的数据集token_ids路径	str	否		自定义数据记token_id 例如： --tokenids_path /XX/path_to/smooth_tokenids.txt 具体样式见附录3
--multi-query-mode	是否使用multi-query-attention（for smooth）	bool	否	False

服务端主要参数

用户配置参数项	描述	取值类型	是否必选	是否AIAK特有	默认值	取值范围
–host	主机名	str	否	否
–port	端口号	int	否	否	8000
–lora-modules	LoRA模块配置，格式为’name=path’或JSON格式	str	否	否
–chat-template	聊天模板的文件路径，或指定模型的单行模板	str	否	否
–model	要使用的huggingface模型的名称或路径	str	是	否	facebook/opt-125m
–tokenizer	要使用的huggingface分词器的名称或路径，如果未指定，将使用模型名称或路径	str	否	否
--tokenizer-extension-path	自定义tokenizer中vllm_extension.py的路径	str	否	是
–trust-remote-code	信任来自huggingface的远程代码	bool	否	否	FALSE
–dtype	模型权重和激活的数据显示类型	str	否	否	auto	auto,half,float16,bfloat16,float,float32
–kv-cache-dtype	kv缓存存储的数据类型	str	否	否	auto	auto,fp8,fp8_e5m2,fp8_e4m3
–max-model-len	模型的上下文长度	int	否	否
–distributed-executor-backend	用于分布式服务的后端，当使用多个GPU时，如果安装了ray，将自动设置为’ray’，否则为’mp’（多进程）	str	否	否		ray,mp
–max-seq-len-to-capture	CUDA图覆盖的最大序列长度，超过此值回退到eager模式	int	否	否	8192
–num-scheduler-steps	每个调度器调用的最大前向步数	int	否	否	1
–pipeline-parallel-size,-pp	管道并行阶段的数量	int	否	否	1
–tensor-parallel-size,-tp	张量并行副本的数量	int	否	否	1
–enable-prefix-caching	启用自动前缀缓存	bool	否	否	FALSE
–gpu-memory-utilization	用于模型执行器的GPU内存占用率，范围从0到1	float	否	否	0.9	0 到 1
–max-num-batched-tokens	每次迭代的最大批处理标记数	int	否	否
–max-num-seqs	每次迭代的最大序列数	int	否	否	256
–quantization,-q	用于量化权重的方法	str	否	否		awq;gpt;weight_only_int8;smooth_quant;None
–enforce-eager	始终使用eager模式的PyTorch	bool	否	否	FALSE
–scheduler-delay-factor	调度下一个提示前的延迟因子	float	否	否	0
–enable-chunked-prefill	设置后，预填充请求可根据max_num_batched_tokens分块	bool	否	否
–disable-async-output-proc	禁用异步输出处理	bool	否	否	FALSE

客户端输入参数

用户配置参数项	描述	取值类型	是否必选	是否AIAK特有	默认值	取值范围
model	要使用的模型ID。	str	是	否	无
prompt	用于生成结果的提示内容。	str	是	否	无
best_of	生成多个结果并返回最佳的一个。	int	否	否	1	best_of 必须大于 n
echo	是否在输出中包含提示内容。	bool	否	否	FALSE
frequency_penalty	控制重复内容的生成	float	否	否	0	-2.0 到 2.0
logit_bias	调整特定token出现的概率。	map	否	否	null	-100 到 100
logprobs	返回最可能的token及其对数概率。	int	否	否	null	整数，最大值为5。
max_tokens	生成的最大token数。	int	否	否	16	整数
n	为每个提示生成的结果数量。	int	否	否	1	整数
presence_penalty	鼓励生成新内容	float	否	否	0	-2.0 到 2.0
seed	指定随机种子以获得可重复的结果。	int	否	否	null
stop	指定生成结束的token	str	否	否	null
stream	是否以流式方式返回生成结果。	bool	否	否	FALSE
stream_options	流式响应的附加选项。	object	否	否	null
suffix	结果后附加的后缀内容。仅支持特定模型。	str	否	否	null
temperature	控制生成的随机性	float	否	否	1	0 到 2.0
top_p	使用核采样方法，考虑累积概率为top_p的token。	float	否	否	1	0 到 1.0
user	终端用户的唯一标识符。	str	否	否	无

客户端输出参数

输出参数项名称	描述	取值类型	是否AIAK特有
id	响应的唯一标识符，用于标识此次请求的结果	str
choices	模型为输入提示生成的结果选项列表。	array
created	响应生成的时间戳（Unix 时间，秒为单位）	int
model	用于生成结果的模型名称。	str
system_fingerprint	后端配置的指纹标识，可与 seed 请求参数结合使用。	str
object	对象类型，为“text_completion”。	str
usage	结果请求的使用统计信息。	object
completion_tokens	生成的结果中使用的token数量。	int
prompt_tokens	提示中使用的token数量。	int
total_tokens	请求中使用的token总数（提示+结果）。	int
completion_tokens_details	结果中使用的token的详细。	object
prompt_tokens_details	输入提示中使用的token的详细信息。	object
sentence_length	返回句子的总长度（字符数）。	int	是
cum_log_probs	生成token的对数概率（log_probs）之和。	float	是

监控指标参数

中文指标	英文指标	指标定义
请求总数(Total)	Count of successfully receives requests	vllm:request_total
平均处理请求耗时	End to end request latency in seconds	vllm:e2e_request_latency_seconds
处理的成功请求总数(Success)	Count of successfully processed requests	vllm:request_process_success_total
推理请求处理执行时间	Inference compute infer duration in seconds	vllm:inference_compute_infer_duration
推理服务每秒请求数(Failed)	Count of failed processed requests	vllm:request_process_fail_total
首token时间（秒）	Time to first token in seconds	vllm:time_to_first_token_seconds
每个输出token的时间	Time per output token in seconds	vllm:time_per_output_token_seconds
处理的输入token数	Prefill tokens processed	vllm:request_prompt_tokens
处理的生成token数	Generation tokens processed	vllm:request_generation_tokens
返回的输出序列数	The n request parameter	vllm:request_params_n
来自引擎的累计抢占次数	Cumulative number of preemption from the engine	vllm:num_preemptions_total
输入token数	Number of prefill tokens processed	vllm:prompt_tokens_total
输出生成token数	Number of generation tokens processed	vllm:generation_tokens_total
接收到成功请求总数	Count of successfully receives requests	vllm:request_success_total
LoRA模型请求总数	Total number of LoRA model requests	vllm:lora_count
LoRA模型的请求总时间	Total request time for the LoRA model	vllm:lora_time_e2e_requests

推理说明

高级功能附录

百度智能云

百舸异构计算平台 AIHC

百舸异构计算平台 AIHC

参数说明

参数说明

内置量化工具参数

服务端主要参数

客户端输入参数

客户端输出参数

监控指标参数