推理参数说明

更新时间：2024-09-26

本文为您介绍AIAK-inference适用的推理参数详情，建议您在使用AIAK-inference进行模型推理加速前先查看本文档进行了解学习。

量化参数

参数名	取值类型	是否必选	默认值	可选值	描述
-i	str	是			原始模型权重输入路径
-o	str	是			量化后的模型权重输出路径
-quant_type	str	是		weight_only_int8、smooth_quant、awq、gptq、squeezellm	量化算法
-tp	int	是		1、2、4、8	服务部署的GPU卡数
-t	str	是		fp16、bf16	指定非量化的部分存储类型
-sq	float	否		取值范围[0-1]之间小数(llama默认是0.8， glm130是0.75)	指定smooth-quant的量化smoother参数
-token	str	否			特殊需求时，指定smooth-quant量化需要的token_ids路径
--multi-query-mode	bool	否	False		是否使用multi-query-attention（for smooth）

算法模型支持

算法名称	模型支持
weight_only_int8	llama-7b-hf, llama-13b-hf, llama-65b-hf, llama-2-7b-hf, llama-2-13b-hf, llama-2-70b-hf, baichuan1-7b/13b
smooth_quant	llama-7b-hf, llama-13b-hf, llama-65b-hf, llama-2-7b-hf, llama-2-13b-hf, llama-2-70b-hf，baichuan7b，baichuan13b，qwen1.5-14b,qwen1.5-72b（注：非量化数据类型，目前bf16会转换成fp16）
gptq	qwen1.5-14b、qwn1.5-72b、qwen-14b、qwen-72b

推理参数

参数名	取值类型	是否必选	默认值	可选值	描述
--model_name:	str	是		llama、llama2、chatglm2、baichuan、glm、qwen1、qwen1.5、InternLM2-20B-chat	用于识别模型的名称
--num_gpus	int	是	1		使用的GPU数量。
--ckpt_path		是			checkpoint的路径
--data_type	str	是		fp16、fp32	数据类型
--batch_size	int	否	8		批处理的大小。
--tokenizer_path	str	否	ckpt_path		tokenizer的路径。
--extension_path	str	否	/dev/		扩展路径（使用默认值时，表示没有扩展）
--quant_mode	str	否		weight_only_int8、smooth_quant、awq、gptq、squeezellm	量化模式，可以不开启或启用量化
--enforce_eager	bool	否	True		强制使用eager-mode Pytorch，默认值为True；如果设置为False，将混合使用CUDA graph与eager mode。
--gpu_memory_utilization	float	否	0.99		模型推理过程显存使用比例；如果开启CUDA graph，可能需要降低显存使用比例。
--grpc_port	int	否	8001		Triton服务器的gRPC端口。
--http_port	int	否	8000		Triton服务器的HTTP端口。
--metrics_port	int	否	8002		Triton服务器的指标端口。
--log_verbose	bool	否	False		如果设置为true，则启用详细的日志记录。
--kv_cache_dtype	str	否	auto	auto、fp8_e5m2	复用kv cache提高推理效率；默认为auto不开启，如需开启kv cache请设置为 fp8_e5m2。
--no_prompts	bool	否	False		输出内容可选是否包含输入的prompt；默认为False，输出内容包含输入的prompt；如需过滤prompt请设置为True
--enable_decouple	bool	否	True		开启流式返回
--task_type	str	是	causal_lm	causal_lm、sequence_classification	模型类型，生成式（默认）、判别式
--spec_dec_type	str	否	none	none、medusa	设置投机采样模式，目前仅支持medusa或none，默认为none
--draft_model	str	否			配合投机采样的模型路径（对于Medusa是训练后的模型路径），当--spec_dec_type不为none时必填
--propose_cnt	int	否			投机采样数量，数量越大，采样命中率越高，但单次计算时间越长，可以根据业务情况调整；对于Medusa模式，这里是3个int组成的列表，推荐值为`1,3,4`，当--spec_dec_type不为none时必填
--max_num_seqs	int	否	none	正整数或none	在一次迭代中可以处理的最大序列数量
--max_num_batched_tokens	int	否	none	正整数数字或none	在一次迭代中可以处理的最大token数量
--max_model_len	int	否	none	正整数数字或none	序列的最大长度（包括提示和生成的文本）

API调用示例

工作流

百舸异构计算平台 AIHC

百舸异构计算平台 AIHC

推理参数说明

量化参数

推理参数