使用AIAK进行推理加速
简介
AIAK-Inference-LLM是百度智能云基于百舸异构计算平台,面向大模型推理场景提供的最佳实践解决方案配套 AI 加速工具,帮助模型开发者高效完成大规模深度学习推理部署,提升推理效率,相比开源vLLM性能可大幅提升。
使用前提
- 基础设施为百度智能云云资源(购买百度云CCE及配套资源)
- 运行环境硬件资源须和加速镜像支持的芯片匹配
- 运行环境操作系统版本CentOS7+
软件包说明
环境要求
芯片 | H800、A800 |
---|---|
操作系统 | CentOS 7+ |
基础依赖
基础镜像 | ubuntu 22.04 |
---|---|
PyTorch | v2.1.2 |
CUDA | v12.1 |
Python | v3.10 |
支持模型
模型系列 | 模型名称 |
---|---|
Llama | 7B、13B、65B |
Llama2 | 7B、13B、70B |
ChatGLM | 6B |
ChatGLM2 | 6B |
ChatGLM3 | 6B |
GLM | 130B |
InternLM2 | 20B |
Baichuan2 | 7B、13B |
Qwen | 7B、14B、72B |
Qwen1.5 | 0.5B、1.8B、4B、7B、14B、72B |
Mixtral | 8x7B |
如果您对于其他暂未支持的模型有推理需求,请提交工单或者联系您的客户经理。
如何使用
镜像准备
点击页面的【获取地址】可以得到打包好的AIAK推理加速镜像下载地址
# 以下是一组镜像地址示例,实际使用时需使用通过【获取地址】得到的镜像地址
registry.baidubce.com/aihc-aiak/aiak-inference-llm:ubuntu22.04-cu12.1-torch2.1.2-py310_1.3.2.4
使用docker启动测试实例
需要规模使用时,请使用标准K8S模式发起部署
#启动docker
docker run --gpus all -itd --name infer_test --shm-size=32768m --privileged --user=root --network host -v /PATH/TO/MODEL:/mnt/model registry.baidubce.com/aihc-aiak/aiak-inference-llm:ubuntu22.04-cu12.1-torch2.1.2-py310_1.3.3.2 /bin/bash
#进入docker
docker exec -it infer_test bash
使用量化工具对模型进行量化
进入docker服务命令行,量化时间与模型大小和卡数有关,需要十几分钟到1小时左右
# 进入量化工具所在目录
cd /workspace/aiak-model-quant-tool/
# weight_only_int8量化
python3 model_quantization.py -i /input_path/ -o ./output_path/ -tp 1 -quant_type weight_only_int8 -t fp16
# smooth-quant量化
python3 model_quantization.py -i /input_path/ -o ./output_path/ -tp 1 -quant_type smooth_quant -t fp16 -sq 0.75
# 使用示例 python3 model_quantization.py -i /input_path/ -o /output_path/ -quant_type weight_only_int8 -tp 2
注意:详细参数查阅《参数说明-量化参数》章节
- 量化输出示例
=============== Argument ===============
out_dir: /Qwen-14B/
in_file: /Qwen-14B/
tensor_parallelism: 2
model_quantization_type: weight_only_int8
multi_query_mode: False
========================================
INFO 12-01 08:30:18 model_quantization.py:35] quantization an LLM engine model_config:QWenConfig {
INFO 12-01 08:30:18 model_quantization.py:35] "architectures": [
INFO 12-01 08:30:18 model_quantization.py:35] "QWenLMHeadModel"
INFO 12-01 08:30:18 model_quantization.py:35] ],
INFO 12-01 08:30:18 model_quantization.py:35] "attn_dropout_prob": 0.0,
INFO 12-01 08:30:18 model_quantization.py:35] "auto_map": {
INFO 12-01 08:30:18 model_quantization.py:35] "AutoConfig": "configuration_qwen.QWenConfig",
INFO 12-01 08:30:18 model_quantization.py:35] "AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel"
INFO 12-01 08:30:18 model_quantization.py:35] },
INFO 12-01 08:30:18 model_quantization.py:35] "bf16": false,
INFO 12-01 08:30:18 model_quantization.py:35] "emb_dropout_prob": 0.0,
INFO 12-01 08:30:18 model_quantization.py:35] "fp16": false,
INFO 12-01 08:30:18 model_quantization.py:35] "fp32": false,
INFO 12-01 08:30:18 model_quantization.py:35] "hidden_size": 5120,
INFO 12-01 08:30:18 model_quantization.py:35] "initializer_range": 0.02,
INFO 12-01 08:30:18 model_quantization.py:35] "intermediate_size": 27392,
INFO 12-01 08:30:18 model_quantization.py:35] "kv_channels": 128,
INFO 12-01 08:30:18 model_quantization.py:35] "layer_norm_epsilon": 1e-06,
INFO 12-01 08:30:18 model_quantization.py:35] "max_position_embeddings": 8192,
INFO 12-01 08:30:18 model_quantization.py:35] "model_type": "qwen",
INFO 12-01 08:30:18 model_quantization.py:35] "no_bias": true,
INFO 12-01 08:30:18 model_quantization.py:35] "num_attention_heads": 40,
INFO 12-01 08:30:18 model_quantization.py:35] "num_hidden_layers": 40,
INFO 12-01 08:30:18 model_quantization.py:35] "onnx_safe": null,
INFO 12-01 08:30:18 model_quantization.py:35] "rotary_emb_base": 10000,
INFO 12-01 08:30:18 model_quantization.py:35] "rotary_pct": 1.0,
INFO 12-01 08:30:18 model_quantization.py:35] "scale_attn_weights": true,
INFO 12-01 08:30:18 model_quantization.py:35] "seq_length": 2048,
INFO 12-01 08:30:18 model_quantization.py:35] "tie_word_embeddings": false,
INFO 12-01 08:30:18 model_quantization.py:35] "tokenizer_class": "QWenTokenizer",
INFO 12-01 08:30:18 model_quantization.py:35] "transformers_version": "4.34.0",
INFO 12-01 08:30:18 model_quantization.py:35] "use_cache": true,
INFO 12-01 08:30:18 model_quantization.py:35] "use_dynamic_ntk": true,
INFO 12-01 08:30:18 model_quantization.py:35] "use_flash_attn": "auto",
INFO 12-01 08:30:18 model_quantization.py:35] "use_logn_attn": true,
INFO 12-01 08:30:18 model_quantization.py:35] "vocab_size": 152064
INFO 12-01 08:30:18 model_quantization.py:35] }
INFO 12-01 08:30:18 model_quantization.py:35] quantization tp_size is: 2,
INFO 12-01 08:33:29 model_quantization.py:35] ==quantization= state_dict===key: transformer.h.0.attn.c_proj.qscale ==value.shape: torch.Size([2, 5120])
INFO 12-01 08:33:29 model_quantization.py:35] ==quantization= state_dict===key: transformer.h.0.attn.c_proj.qweight ==value.shape: torch.Size([5120, 1280])
INFO 12-01 08:33:29 model_quantization.py:35] ==quantization= state_dict===key: transformer.h.0.ln_1.weight ==value.shape: torch.Size([5120])
INFO 12-01 08:33:29 model_quantization.py:35] ==quantization= state_dict===key: transformer.h.0.ln_2.weight ==value.shape: torch.Size([5120])
INFO 12-01 08:33:29 model_quantization.py:35] ==quantization= state_dict===key: transformer.wte.weight ==value.shape: torch.Size([152064, 5120])
('json_quant_config: ', {'w_bit': 8})
……
INFO 12-01 08:33:54 model_quantization.py:35] Quantization model save path: /output_path/2-gpu
启动推理服务
# 镜像里启动脚本参数示例
bash run_triton_v2.sh --model_name=llama --ckpt_path=/ckpt_path/ --data_type=fp16 --quant_mode=weight_only_int8
注意: 详细参数查阅《参数说明-推理参数》章节; 目前只支持huggingface的checkpoint类型,如果需要使用Megatron的checkpoint,需要先将其转换为HF格式。
测试请求
流式请求
使用镜像启动client镜像,使用client脚本即可发送grpc流式请求。 第一步:准备client脚本
#!/usr/bin/env python3
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
import google.protobuf.json_format
import json
import multiprocessing as mp
import numpy as np
import time
import tritonclient.grpc as grpcclient
import tritonclient.http as httpclient
from argparse import ArgumentParser
from collections.abc import Mapping
from functools import partial
from tritonclient.grpc.service_pb2 import ModelInferResponse
from tritonclient.utils import np_to_triton_dtype
def deep_update(source, overrides):
"""
Update a nested dictionary or similar mapping.
Modify ``source`` in place.
"""
for key, value in overrides.items():
if isinstance(value, Mapping) and value:
returned = deep_update(source.get(key, {}), value)
source[key] = returned
else:
source[key] = overrides[key]
return source
def parse_args():
parser = ArgumentParser()
parser.add_argument("request_file", nargs="?", default=None, metavar="request-file")
parser.add_argument("--params")
parser.add_argument("-t", "--test", action="store_true")
args = parser.parse_args()
return args
def generate_parameters(args):
DEFAULT_CONFIG = {
'protocol': 'http',
'url': None,
'model_name': 'fastertransformer',
'verbose': False,
'stream_api': False,
}
params = {'config': DEFAULT_CONFIG, 'request': []}
if args.request_file is not None:
with open(args.request_file) as f:
file_params = json.load(f)
deep_update(params, file_params)
args_params = json.loads(args.params) if args.params else {}
deep_update(params, args_params)
for index, value in enumerate(params['request']):
params['request'][index] = {
'name': value['name'],
'data': np.array(value['data'], dtype=value['dtype']),
}
if params['config']['url'] is None:
if params['config']['protocol'] == 'grpc':
params['config']['url'] = 'localhost:8001'
else:
params['config']['url'] = 'localhost:8000'
return params['config'], params['request']
def prepare_tensor(client, name, input):
t = client.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
t.set_data_from_numpy(input)
return t
def stream_consumer(queue, test: bool):
start_time = None
tokens_before = np.array([], dtype=np.int32)
while True:
result = queue.get()
if result is None:
break
if isinstance(result, float):
start_time = result
continue
message = ModelInferResponse()
google.protobuf.json_format.Parse(json.dumps(result), message)
result = grpcclient.InferResult(message)
tokens = result.as_numpy("OUTPUT_0")
is_finished = result.as_numpy("is_finished")
string = [token.decode() for token in tokens]
print(repr(string[0])[1:-1])
if test:
assert len(tokens) == len(tokens_before) + 1
assert np.array_equal(tokens[:-1], tokens_before)
tokens_before = tokens
def stream_callback(queue, result, error):
if error:
queue.put(error)
else:
queue.put(result.get_response(as_json=True))
def main_stream(config, request, test: bool):
client_type = grpcclient
kwargs = {"verbose": config["verbose"]}
result_queue = mp.Queue()
consumer = mp.Process(target=stream_consumer, args=(result_queue, test))
consumer.start()
with grpcclient.InferenceServerClient(config['url'], verbose=config["verbose"]) as cl:
payload = [prepare_tensor(grpcclient, field['name'], field['data'])
for field in request]
cl.start_stream(callback=partial(stream_callback, result_queue))
result_queue.put(time.perf_counter())
cl.async_stream_infer(config['model_name'], payload)
result_queue.put(None)
consumer.join()
def main_sync(config, request):
is_http = config['protocol'] == 'http'
client_type = httpclient if is_http else grpcclient
kwargs = {"verbose": config["verbose"]}
if is_http:
kwargs["concurrency"] = 10
with client_type.InferenceServerClient(config['url'], **kwargs) as cl:
payload = [prepare_tensor(client_type, field['name'], field['data'])
for field in request]
result = cl.infer(config['model_name'], payload)
if is_http:
for output in result.get_response()['outputs']:
print("{}:\n{}\n".format(output['name'], result.as_numpy(output['name'])))
else:
for output in result.get_response().outputs:
print("{}:\n{}\n".format(output.name, result.as_numpy(output.name)))
if __name__ == "__main__":
args = parse_args()
config, request = generate_parameters(args)
if not config['stream_api']:
main_sync(config, request)
else:
main_stream(config, request, args.test)
第二步:修改请求内容,修改sample_request_ensemble.json文件。
{
"config": {
"model_name": "ensemble",
"protocol": "grpc",
"stream_api": true
},
"request": [
{
"name": "INPUT_0",
"data": [["Malika Louback believes her three engineering degrees make her"]],
"dtype": "object"
},
{
"name": "INPUT_1",
"data": [[1024]],
"dtype": "uint32"
},
{
"name": "beam_search_diversity_rate",
"data": [[0.0]],
"dtype": "float32"
},
{
"name": "temperature",
"data": [[0.0]],
"dtype": "float32"
},
{
"name": "len_penalty",
"data": [[1.0]],
"dtype": "float32"
},
{
"name": "repetition_penalty",
"data": [[1.0]],
"dtype": "float32"
},
{
"name": "random_seed",
"data": [[0]],
"dtype": "uint64"
},
{
"name": "is_return_log_probs",
"data": [[true]],
"dtype": "bool"
},
{
"name": "beam_width",
"data": [[1]],
"dtype": "uint32"
},
{
"name": "runtime_top_k",
"data": [[-1]],
"dtype": "int32"
},
{
"name": "runtime_top_p",
"data": [[1.0]],
"dtype": "float32"
},
{
"name": "start_id",
"data": [[1]],
"dtype": "uint32"
},
{
"name": "end_id",
"data": [[2002]],
"dtype": "uint32"
},
{
"name": "INPUT_2",
"data": [[""]],
"dtype": "object"
},
{
"name": "INPUT_3",
"data": [[""]],
"dtype": "object"
}
]
}
其中INPUT_0字段表示请求内容,INPUT_1字段表示希望LLM生成的最大token数,INPUT_3表示end_words。修改完成后保存文件即可。
第三步:发送请求,使用 python3 issue_request.py sample_request_ensemble.json工具发送请求。
参数说明
量化参数
参数名 | 取值类型 | 是否必选 | 默认值 | 可选值 | 描述 |
---|---|---|---|---|---|
-i | str | 是 | 原始模型权重输入路径 | ||
-o | str | 是 | 量化后的模型权重输出路径 | ||
-quant_type | str | 是 | weight_only_int8、smooth_quant、awq、gptq、squeezellm | 量化算法 | |
-tp | int | 是 | 1、2、4、8 | 服务部署的GPU卡数 | |
-t | str | 是 | fp16、bf16 | 指定非量化的部分存储类型 | |
-sq | float | 否 | 取值范围[0-1]之间小数(llama默认是0.8, glm130是0.75) | 指定smooth-quant的量化smoother参数 | |
-token | str | 否 | 特殊需求时,指定smooth-quant量化需要的token_ids路径 | ||
--multi-query-mode | bool | 否 | False | 是否使用multi-query-attention(for smooth) |
- 算法模型支持
算法名称 | 模型支持 |
---|---|
weight_only_int8 | llama-7b-hf, llama-13b-hf, llama-65b-hf, llama-2-7b-hf, llama-2-13b-hf, llama-2-70b-hf, baichuan1-7b/13b |
smooth_quant | llama-7b-hf, llama-13b-hf, llama-65b-hf, llama-2-7b-hf, llama-2-13b-hf, llama-2-70b-hf,baichuan7b,baichuan13b,qwen1.5-14b,qwen1.5-72b(注:非量化数据类型,目前bf16会转换成fp16) |
gptq | qwen1.5-14b、qwn1.5-72b、qwen-14b、qwen-72b |
推理参数
参数名 | 取值类型 | 是否必选 | 默认值 | 可选值 | 描述 |
---|---|---|---|---|---|
--model_name: | str | 是 | llama、llama2、chatglm2、baichuan、glm、qwen1、qwen1.5、InternLM2-20B-chat | 用于识别模型的名称 | |
--num_gpus | int | 是 | 1 | 使用的GPU数量。 | |
--ckpt_path | 是 | checkpoint的路径 | |||
--data_type | str | 是 | fp16、fp32 | 数据类型 | |
--batch_size | int | 否 | 8 | 批处理的大小。 | |
--tokenizer_path | str | 否 | ckpt_path | tokenizer的路径。 | |
--extension_path | str | 否 | /dev/ | 扩展路径(使用默认值时,表示没有扩展) | |
--quant_mode | str | 否 | weight_only_int8、smooth_quant、awq、gptq、squeezellm | 量化模式,可以不开启或启用量化 | |
--enforce_eager | bool | 否 | True | 强制使用eager-mode Pytorch,默认值为True;如果设置为False,将混合使用CUDA graph与eager mode。 | |
--gpu_memory_utilization | float | 否 | 0.99 | 模型推理过程显存使用比例;如果开启CUDA graph,可能需要降低显存使用比例。 | |
--grpc_port | int | 否 | 8001 | Triton服务器的gRPC端口。 | |
--http_port | int | 否 | 8000 | Triton服务器的HTTP端口。 | |
--metrics_port | int | 否 | 8002 | Triton服务器的指标端口。 | |
--log_verbose | bool | 否 | False | 如果设置为true,则启用详细的日志记录。 | |
--kv_cache_dtype | str | 否 | auto | auto、fp8_e5m2 | 复用kv cache提高推理效率;默认为auto不开启,如需开启kv cache请设置为 fp8_e5m2。 |
--no_prompts | bool | 否 | False | 输出内容可选是否包含输入的prompt;默认为False,输出内容包含输入的prompt;如需过滤prompt请设置为True | |
--enable_decouple | bool | 否 | True | 开启流式返回 | |
--task_type | str | 是 | causal_lm | causal_lm、sequence_classification | 模型类型,生成式(默认)、判别式 |
--spec_dec_type | str | 否 | none | none、medusa | 设置投机采样模式,目前仅支持medusa或none,默认为none |
--draft_model | str | 否 | 配合投机采样的模型路径(对于Medusa是训练后的模型路径),当--spec_dec_type不为none时必填 | ||
--propose_cnt | int | 否 | 投机采样数量,数量越大,采样命中率越高,但单次计算时间越长,可以根据业务情况调整;对于Medusa模式,这里是3个int组成的列表,推荐值为1,3,4 ,当--spec_dec_type不为none时必填 |
||
--max_num_seqs | int | 否 | none | 正整数或none | 在一次迭代中可以处理的最大序列数量 |
--max_num_batched_tokens | int | 否 | none | 正整数数字或none | 在一次迭代中可以处理的最大token数量 |
--max_model_len | int | 否 | none | 正整数数字或none | 序列的最大长度(包括提示和生成的文本) |
附录
附录1:whisper模型使用示例
"""
Example code to run OpenAI Whisper
"""
from vllm import SamplingParams, LLM
from transformers import WhisperProcessor
from evaluate import load
from tqdm import tqdm
from datasets import load_dataset
import sys
if len(sys.argv) != 4:
print(f'Usage: python3 {sys.argv[0]} /path/to/model /path/to/dataset /path/to/wer')
exit(0)
path_to_model = sys.argv[1]
path_to_dataset = sys.argv[2]
wer_path = sys.argv[3]
librispeech_test_clean = load_dataset(path_to_dataset, "clean", split="test")
processor = WhisperProcessor.from_pretrained(path_to_model)
llm = LLM(model=path_to_model, gpu_memory_utilization=0.4)
sampling_params = SamplingParams(temperature=0.0, max_tokens=1024)
references = []
results = []
batch_size = 32
batch_features = []
def run_samples(input_features, sampling_params, pbar):
"""
Run processing with batch
"""
results = []
outputs = llm.generate(input_features=input_features,
sampling_params=sampling_params,
use_tqdm=False)
for output in outputs:
generated_text = output.outputs[0].text
generated_text = processor.tokenizer._normalize(generated_text)
results.append(generated_text)
pbar.update(1)
return results
pbar = tqdm(desc='samples', total=len(librispeech_test_clean))
for batch in librispeech_test_clean:
audio = batch['audio']
# append reference
references.append(processor.tokenizer._normalize(batch['text']))
input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
input_features = input_features.half().cuda()
batch_features.append(input_features)
if len(batch_features) == batch_size:
# actual run batch
results += run_samples(batch_features, sampling_params, pbar)
batch_features = []
# process remaining batch
if len(batch_features) > 0:
results += run_samples(batch_features, sampling_params, pbar)
pbar.close()
print(f'processed sample: {len(references)}')
# print(f'results: {results}')
# print(f'referneces: {references}')
wer = load(wer_path)
print(f'wer: {100 * wer.compute(references=references, predictions=results):.2f} %')
附录2:sample_request_ensemble.json部分字段解释
Name | Usage | Description |
---|---|---|
INPUT_0 | input_ids | The input ids (context) |
INPUT_1 | input_lengths | The max output lengths of output token ids |
INPUT_3 | stop_words_list | Optional. When model generates words in this list, it will stop the generation. An extension of stop id |
注意:sample_request_ensemble.json中所有字段须保持batch_size一致。
{
"config": {
"model_name": "ensemble",
"protocol": "grpc",
"stream_api": true
},
"request": [
{
"name": "INPUT_0",
"data": [["Tell me about your self"],
["Who are you"],
["Malika Louback believes her three engineering degrees make her"],
["Racer's hurricane in 1837 was named after"],
["Jason Robertson is only the second Filipino American to"],
["The 1886 song of thanks \"Ein Danklied sei dem Herrn\" was performed at"],
["Fori Nehru founded an employment campaign in 1947 to sell"],
["1+1="]],
"dtype": "object"
},
{
"name": "INPUT_1",
"data": [[24], [24], [24], [24], [24], [24], [24], [24]],
"dtype": "uint32"
},
{
"name": "beam_search_diversity_rate",
"data": [[0], [0], [0], [0], [0], [0], [0], [0]],
"dtype": "float32"
},
{
"name": "temperature",
"data": [[1.0], [1.0], [1.0], [1.0], [1.0], [1.0], [1.0], [1.0]],
"dtype": "float32"
},
{
"name": "len_penalty",
"data": [[1.0], [1.0], [1.0], [1.0], [1.0], [1.0], [1.0], [1.0]],
"dtype": "float32"
},
{
"name": "repetition_penalty",
"data": [[1.0], [1.0], [1.0], [1.0], [1.0], [1.0], [1.0], [1.0]],
"dtype": "float32"
},
{
"name": "random_seed",
"data": [[0], [0], [0], [0], [0], [0], [0], [0]],
"dtype": "uint64"
},
{
"name": "is_return_log_probs",
"data": [[true], [true], [true], [true], [true], [true], [true], [true]],
"dtype": "bool"
},
{
"name": "beam_width",
"data": [[1], [1], [1], [1], [1], [1], [1], [1]],
"dtype": "uint32"
},
{
"name": "runtime_top_k",
"data": [[1], [1], [1], [1], [1], [1], [1], [1]],
"dtype": "int32"
},
{
"name": "runtime_top_p",
"data": [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]],
"dtype": "float32"
},
{
"name": "start_id",
"data": [[1], [1], [1], [1], [1], [1], [1], [1]],
"dtype": "uint32"
},
{
"name": "end_id",
"data": [[2], [2], [2], [2], [2], [2], [2], [2]],
"dtype": "uint32"
},
{
"name": "INPUT_2",
"data": [[""],
[""],
[""],
[""],
[""],
[""],
[""],
[""]],
"dtype": "object"
},
{
"name": "INPUT_3",
"data": [[""],
[""],
[""],
[""],
[""],
[""],
[""],
[""]],
"dtype": "object"
}
]
}
字段 | 类型 | 必填 | 默认值 | 说明 | 限制 |
---|---|---|---|---|---|
beam_search_diversity_rate | float | 是 | 1.0 | beam_search多样性比率 | |
temperature | float | 是 | 1.0 | 控制算法输出概率分布的"软度"或"硬度"的参数。 | |
len_penalty | float | 是 | 1.0 | 用于平衡生成的文本的长度与其质量之间的权衡。 | |
is_return_log_probs | bool | 是 | false | 是否返回log probs。 | |
beam_width | int | 是 | 1 | beam宽度 | |
runtime_top_k | int | 是 | 1 | top_k采样参数 | runtime_top_k参数dtype类型从uint32变更为int32 |
runtime_top_p | float | 是 | 1.0 | top_p采样参数 | |
start_id | int | 是 | 0 | tokenizer的start_id | |
end_id | int | 是 | 0 | tokenizer的end_id | |
INPUT_0 | string | 是 | 空 | 输入字符串(上下文) | |
INPUT_1 | int | 是 | 0 | 输出 id 的最大长度 | |
output_seq_len | uint32_t | 是 | 0 | 您希望得到的结果中最大的token数量。注意,它包含输入长度 | |
INPUT_3 | string | 是 | 空 | 可选。当遇到在此列表中生成单词时,它将停止生成。停止id的扩展,即stop words list。 | |
repetition_penalty | float | 是 | 1.0 | 可选。将重复罚分应用于束搜索和采样的logits。与presence_penalty互斥 | |
random_seed | unsigned long long int | 是 | 0 | 可选。用于初始化采样中的随机表的随机种子 | |
OUTPUT_0 | string | 输出字符串 | |||
sequence_length | int | 输出id的长度 | |||
cum_log_probs | float | 生成句子的累积对数概率 | |||
output_log_probs | float | 它记录了每一步采样的 logits 的对数概率。 |
附录3:extension插件说明
样例代码如下,用户可以修改具体函数实现,来实现自定义需求,如自定义模型preprocess和postprocess流程等。 注意:extension插件命名必须为"vllm_extension.py"和“template.json”。
# !/usr/bin/env python3
from transformers import AutoTokenizer
from typing import List
import json
class VllmExtension:
def __init__(self): #不能有构造参数,因为自动构建实例的时候不能传入参数;但可以从当前目录获取config
import pathlib
folder_path = pathlib.Path(__file__).parent.resolve()
with open(f'{folder_path}/template.json', encoding='utf-8') as f:
self.template = json.load(f)
def get_tokenizer(self, tokenizer_path: str) -> AutoTokenizer:
return AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True, use_fast=False)
def text_to_tokens(self, tokenizer: AutoTokenizer, text: str) -> List[int]:
text = self.template['prompt'].replace('#CONTENT#', text)
prompt_tokens = tokenizer.encode(text)
return prompt_tokens
def tokens_to_text(self, tokenizer: AutoTokenizer, output_tokens: List[int]) -> str:
return tokenizer.decode(output_tokens, skip_special_tokens=False, clean_up_tokenization_spaces=False)
{
"prompt": "[Round 1]\n\n问:#CONTENT#\n\n答:"
}
附录4:SequenceClassification使用示例
input-format:[["input 1"], ["input 2"], ["input 3"]]
ouput-format:[[[output 1]
[output 2]
[output 3]]]
注意事项:
在使用SequenceClassification功能时确保config.json文件有“num_labels”配置字段,"architectures"字段,类似为XXForSequenceClassification这种
(支持llama:LlamaForSequenceClassification, qwen:QWenForSequenceClassification)
for-example:
num_labels:10
output[0.1, 0.2,....,1.], len(output)=10
config.json截图实例:
请求脚本示例:执行命令 python3 issue_request.py request_vllm.json
import google.protobuf.json_format
import json
import multiprocessing as mp
import numpy as np
import time
import tritonclient.grpc as grpcclient
import tritonclient.http as httpclient
from argparse import ArgumentParser
from collections.abc import Mapping
from functools import partial
from tritonclient.grpc.service_pb2 import ModelInferResponse
from tritonclient.utils import np_to_triton_dtype
def deep_update(source, overrides):
"""
Update a nested dictionary or similar mapping.
Modify ``source`` in place.
"""
for key, value in overrides.items():
if isinstance(value, Mapping) and value:
returned = deep_update(source.get(key, {}), value)
source[key] = returned
else:
source[key] = overrides[key]
return source
def parse_args():
parser = ArgumentParser()
parser.add_argument("request_file", nargs="?", default=None, metavar="request-file")
parser.add_argument("--params")
parser.add_argument("-t", "--test", action="store_true")
args = parser.parse_args()
return args
def generate_parameters(args):
DEFAULT_CONFIG = {
'protocol': 'http',
'url': None,
'model_name': 'fastertransformer',
'verbose': False,
'stream_api': False,
}
params = {'config': DEFAULT_CONFIG, 'request': []}
if args.request_file is not None:
with open(args.request_file) as f:
file_params = json.load(f)
deep_update(params, file_params)
args_params = json.loads(args.params) if args.params else {}
deep_update(params, args_params)
for index, value in enumerate(params['request']):
params['request'][index] = {
'name': value['name'],
'data': np.array(value['data'], dtype=value['dtype']),
}
if params['config']['url'] is None:
if params['config']['protocol'] == 'grpc':
params['config']['url'] = 'localhost:8991'
params['config']['url'] = 'localhost:8001'
else:
params['config']['url'] = 'localhost:8990'
params['config']['url'] = 'localhost:8000'
return params['config'], params['request']
def prepare_tensor(client, name, input):
t = client.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
t.set_data_from_numpy(input)
return t
def stream_consumer(queue, test: bool):
start_time = None
tokens_before = np.array([], dtype=np.int32)
while True:
result = queue.get()
if result is None:
break
if isinstance(result, float):
start_time = result
continue
message = ModelInferResponse()
google.protobuf.json_format.Parse(json.dumps(result), message)
result = grpcclient.InferResult(message)
#idx = result.as_numpy("sequence_length")[0, 0]
#tokens = result.as_numpy("OUTPUT_0")[:idx][0].decode('UTF-8')
logits = result.as_numpy("OUTPUT_0")
print("[After {:.2f}s] Partial result:\n{}\n".format(
time.perf_counter() - start_time, logits))
#if test:
# assert len(tokens) == len(tokens_before) + 1
# assert np.array_equal(tokens[:-1], tokens_before)
# tokens_before = tokens
def stream_callback(queue, result, error):
if error:
print(error)
queue.put(error)
else:
queue.put(result.get_response(as_json=True))
def main_stream(config, request, test: bool):
client_type = grpcclient
kwargs = {"verbose": config["verbose"]}
result_queue = mp.Queue()
consumer = mp.Process(target=stream_consumer, args=(result_queue, test))
consumer.start()
with grpcclient.InferenceServerClient(config['url'], verbose=config["verbose"]) as cl:
payload = [prepare_tensor(grpcclient, field['name'], field['data'])
for field in request]
cl.start_stream(callback=partial(stream_callback, result_queue))
result_queue.put(time.perf_counter())
cl.async_stream_infer(config['model_name'], payload)
result_queue.put(None)
consumer.join()
def main_sync(config, request):
is_http = config['protocol'] == 'http'
client_type = httpclient if is_http else grpcclient
kwargs = {"verbose": config["verbose"]}
if is_http:
kwargs["concurrency"] = 10
with client_type.InferenceServerClient(config['url'], **kwargs) as cl:
payload = [prepare_tensor(client_type, field['name'], field['data'])
for field in request]
result = cl.infer(config['model_name'], payload)
if is_http:
for output in result.get_response()['outputs']:
print("{}:\n{}\n".format(output['name'], result.as_numpy(output['name'])))
else:
for output in result.get_response().outputs:
print("{}:\n{}\n".format(output.name, result.as_numpy(output.name)))
if __name__ == "__main__":
args = parse_args()
config, request = generate_parameters(args)
if not config['stream_api']:
main_sync(config, request)
else:
main_stream(config, request, args.test)
{
"config": {
"model_name": "vllm",
"protocol": "grpc",
"stream_api": true
},
"request": [
{
"name": "INPUT_0",
"data": [
[
"Is Lili a female name"
],
[
"Malika Louback believes her three engineering degrees make her"
],
[
"Is Lili a female name"
]
],
"dtype": "object"
}
]
}
性能加速效果
分类 | 模型 | 卡数 | vLLM v0.3.2(fp16) | AIAK 1.3.1.6(fp16) | AIAK 1.3.1.6(fp16)vs. vLLM v0.3.2(fp16) |
---|---|---|---|---|---|
基础LLM | Llama3 8B | 1 | 2529.13 | 2838.86 | 12% |
基础LLM | Llama3 70B | 8 | 1247.76 | 1408.06 | 13% |
长序列 | Qwen1.5-72B-Chat | 8 | 134.8979 | 137.6582 | 2% |
MoE | Mixtral-8x22B-v0.1 | 8 | 1347.9856 | 1554.8096 | 15% |
更新日志
v1.3.3
- 新增
qwen1.5 支持Medusa投机采样推理模式,小batch场景下平均性能相比开源模型提升1.5倍 支持设置输出结果为非流式返回 支持设置模型推理支持的最大长度token数量,新增max_num_seqs、max_num_batched_tokens、max_model_len三个配置参数
- 修复
修复自动扩缩容发现的triton hang问题
v1.3.2
- 新模型支持
支持Qwen 1.5 0.5_B/_1.8_B/_4_B/_7_B/14B/72B,_InternLM2-20B以及Mixtral-7Bx8等模型
- 量化工具
新增KV cache FP8,吞吐平均性能提升25%+
- 多芯适配
昇腾910B芯片适配,推理加速后的极限吞吐达到A800的0.7倍
- 性能测试工具
提供配套推理性能测试工具performance-tool ,可覆盖极限吞吐、首token 延迟测试场景
- 性能提升
- 对运行时和请求调度优化,吞吐性能提升10%+
- Llama1/2 支持Medusa投机采样推理模式,低延迟场景下平均性能相比开源模型提升1.5倍