AIAK推理加速组件

更新时间：2023-07-21

概览

AIAK是面向人工智能任务提供的加速引擎，用于优化基于AI主流计算框架搭建的模型，能显著提升AI任务开发、部署的运行效率。

其中，AIAK推理加速套件是通过优化主流的AI框架，例如：Tensorflow、PyTorch产出的模型，降低在线推理延迟、提升服务吞吐，大幅增加异构资源使用效率的推理优化引擎，结合百度智能云的IaaS资源，可进一步提升用户AI场景下的计算效率。

应用场景

AIAK推理加速可支持但是不限于以下场景模型：

自然语言处理，例如Bert、Transformer等。
图像识别，例如ResNet50、MobileNetSSD等。

方案优势

AIAK推理加速组件具有以下优势。

多框架兼容：提供对TensorFlow和PyTorch等框架兼容。
多模型支持：支持对业界主流模型的加速。
轻量便捷：只需少量代码适配即可开启加速能力。

以下列举了一些典型模型基于AIAK和NVIDIA Tesla T4 GPU的推理时延收益，数值越高代表时延越低。

image (1).png

配置步骤

环境准备

GPU云服务器资源。
AIAK推理加速的部署需满足以下运行环境。
- AI开发框架版本：Pytorch 1.8及以上版本，Tensorflow 1.15及以上版本。
- GPU运行环境：Cuda 10.2及以上版本，TensorRT 7及以上版本。
- Python版本：3.6版本。

使用方法

AIAK推理加速支持多产品使用，本文档以加速ResNet50为例子介绍如何在GPU云服务器中使用AIAK推理加速组件，如您需要结合百度智能云容器服务引擎，可参考云原生AI使用文档。

TensorFlow框架

登录百度智能云GPU实例。
提交工单获取最新的加速包下载链接。
准备业务需要的模型，此处以ResNet50为示例。

import os
import numpy as np
import tensorflow.compat.v1 as tf
tf.compat.v1.disable_eager_execution()
 
def _wget_demo_tgz():
    # 此处以下载一个公开的resnet50模型为例。
    url = 'https://cce-ai-native-package-bj.bj.bcebos.com/aiak-inference/examples/models/resnet50.pb'
    local_tgz = os.path.basename(url)
    local_dir = local_tgz.split('.')[0]
    if not os.path.exists(local_dir):
        luno.util.wget(url, local_tgz)
        luno.util.unpack(local_tgz)
    model_path = os.path.abspath(os.path.join(local_dir, "frozen_inference_graph.pb"))
    graph_def = tf.GraphDef()
    with open(model_path, 'rb') as f:
        graph_def.ParseFromString(f.read())
    # 以随机数作为测试数据，可替换为自己的数据集
    test_data = np.random.rand(1, 800, 1000, 3)
    return graph_def, {'image_tensor:0': test_data}
 
graph_def, test_data = _wget_demo_tgz()
 
input_nodes=['image_tensor']
output_nodes = ['detection_boxes', 'detection_scores', 'detection_classes', 'num_detections', 'detection_masks']

运行模型并获取benchmark推理时延。

import time
 
def benchmark(model):
    tf.reset_default_graph()
    with tf.Session() as sess:
        sess.graph.as_default()
        tf.import_graph_def(model, name="")
        # Warmup!
        for i in range(0, 1000):
            sess.run(['image_tensor:0'], test_data)
        # Benchmark!
        num_runs = 1000
        start = time.time()
        for i in range(0, num_runs):
            sess.run(['image_tensor:0'], test_data)
        elapsed = time.time() - start
        rt_ms = elapsed / num_runs * 1000.0
        # Show the result!
        print("Latency of model: {:.2f} ms.".format(rt_ms))
 
# original graph
print("=====Original Performance=====")
benchmark(graph_def)

引入AIAK推理优化组件。

import luno
optimized_model = luno.optimize(
    graph_def,                 # 待优化的模型，此处是tf.GraphDef, 也可以配置为SavedModel的路径。
    'o1',                      # 优化级别，o1或o2。
    device_type='gpu',         # 目标设备，gpu/cpu
    outputs=['detection_boxes', 'detection_scores', 'detection_classes', 'num_detections', 'detection_masks']
)

运行优化后的模型并获取benchmark推理时延。

# optimized graph
print("=====Optimized Performance=====")
benchmark(optimized_model)

可看到同模型在使用AIAK组件后在推理时延上的收益。

CleanShot 2022-05-05 at 16.44.17.png

Pytorch框架

登录百度智能云GPU实例。
提交工单获取最新的加速包下载链接。
准备业务需要的模型，此处以ResNet50为示例。

import os
import time
import torch
import torchvision.models as models

model = models.resnet50().float().cuda()
model = torch.jit.script(model).eval()     # 使用jit转为静态图
dummy = torch.rand(1, 3, 224, 224).cuda()

运行模型并获取benchmark推理时延。

@torch.no_grad()
def benchmark(model, inp):
    for i in range(100):
        model(inp)
    start = time.time()
    for i in range(200):
        model(inp)
    elapsed_ms = (time.time() - start) * 1000
    print("Latency: {:.2f}".format(elapsed_ms / 200))

# benchmark before optimization
print("before optimization:")
benchmark(model, dummy)

引入AIAK推理优化组件。

import luno

optimized_model = luno.optimize(
    model,
    'gpu',
    input_shapes=[[1, 3, 224, 224]],
    test_data=[dummy],
)

运行优化后的模型并获取benchmark推理时延。

# benchmark after optimization
print("after optimization:")
benchmark(optimized_model, dummy)

可看到同模型在使用AIAK组件后在推理时延上的收益。

CleanShot 2022-05-05 at 16.53.41.png

常见问题

GPU驱动版本选择推荐