使用AIAK-Inference 加速推理业务

更新时间：2024-06-18

前提条件

选择CCR中的AIAK-Inference推理加速镜像作为基础镜像。

操作流程

Tensorflow模型优化

在CCR公共镜像的“百度智能云AI镜像”中选择aiak-inference:ubuntu18.04-cu11.2-tf2.4.1-py3.6-aiak1.1-latest这个tag（或使用docker pull registry.baidubce.com/ai-public/aiak-inference:ubuntu18.04-cu11.2-tf2.4.1-py3.6-aiak1.1-latest拉取镜像），镜像内部已经安装好了CUDA、Tensorflow 2.4.1等基础库。

截屏2024-06-18 下午2.45.47.png

在容器内准备ResNet50模型：

import os
import numpy as np
import tensorflow.compat.v1 as tf
tf.compat.v1.disable_eager_execution()
 
def _wget_demo_tgz():
    # 下载一个公开的resnet50模型。
    url = 'http://url/to/your/model/YOUR_MODEL.tar.gz'
    local_tgz = os.path.basename(url)
    local_dir = local_tgz.split('.')[0]
    if not os.path.exists(local_dir):
        luno.util.wget(url, local_tgz)
        luno.util.unpack(local_tgz)
    model_path = os.path.abspath(os.path.join(local_dir, "frozen_inference_graph.pb"))
    graph_def = tf.GraphDef()
    with open(model_path, 'rb') as f:
        graph_def.ParseFromString(f.read())
    # 以随机数作为测试数据。
    test_data = np.random.rand(1, 800, 1000, 3)
    return graph_def, {'image_tensor:0': test_data}
 
graph_def, test_data = _wget_demo_tgz()
 
input_nodes=['image_tensor']
output_nodes = ['detection_boxes', 'detection_scores', 'detection_classes', 'num_detections', 'detection_masks']

然后尝试推理这个模型：

import time
 
def benchmark(model):
    tf.reset_default_graph()
    with tf.Session() as sess:
        sess.graph.as_default()
        tf.import_graph_def(model, name="")
        # Warmup!
        for i in range(0, 1000):
            sess.run(['image_tensor:0'], test_data)
        # Benchmark!
        num_runs = 1000
        start = time.time()
        for i in range(0, num_runs):
            sess.run(['image_tensor:0'], test_data)
        elapsed = time.time() - start
        rt_ms = elapsed / num_runs * 1000.0
        # Show the result!
        print("Latency of model: {:.2f} ms.".format(rt_ms))
 
# original graph
print("before compile:")
benchmark(graph_def)

接下来引入AIAK-Inference优化：

import aiak_inference
optimized_model = aiak_inference.optimize(
graph_def,
'gpu',
outputs=['detection_boxes', 'detection_scores', 'detection_classes', 'num_detections', 'detection_masks']
)

优化后的模型仍然是一个GraphDef模型，可以使用同样的代码进行推理：

# optimized graph
print("after compile:")
benchmark(optimized_model)

经过比较，可以看到性能有提升：

PyTorch模型优化

在CCR公共镜像的“AI加速镜像”中选择aiak-inference:cuda11.2_cudnn8_trt8.4_torch1.11-aiak_1.1.1_latest加速镜像（或使用docker pull registry.baidubce.com/ai-public/aiak-inference:cuda11.2_cudnn8_trt8.4_torch1.11-aiak_1.1.1_latest拉取镜像），镜像内部已经安装好了CUDA、PyTorch 1.11等基础库。

截屏2024-06-18 下午2.45.47.png

在容器中准备PyTorch相关模型，以ResNet50为例：

import os
import time
import torch
import torchvision.models as models

model = models.resnet50().float().cuda()
model = torch.jit.script(model).eval()     # 使用jit转为静态图
dummy = torch.rand(1, 3, 224, 224).cuda()

尝试进行推理：

@torch.no_grad()
def benchmark(model, inp):
    for i in range(100):
        model(inp)
    start = time.time()
    for i in range(200):
        model(inp)
    elapsed_ms = (time.time() - start) * 1000
    print("Latency: {:.2f}".format(elapsed_ms / 200))

# benchmark before optimization
print("before optimization:")
benchmark(model, dummy)

接着使用AIAK-Inference进行模型优化，并得到优化后的模型：

import aiak_inference

optimized_model = aiak_inference.optimize(
model,
'gpu',
test_data=[dummy],
)

再次进行推理：

# benchmark after optimization
print("after optimization:")
benchmark(optimized_model, dummy)

比较二者性能，可以看到单次推理延迟有大幅下降，证明AIAK-Inference加速能力：

AIAK 简介

使用 AIAK-Training 部署分布式训练任务

百度智能云

容器引擎 CCE