如何让镜像支持RDMA
更新时间:2025-12-23
目前主流的训练容器镜像是基于 Ubuntu 构建的,本文将介绍在如何在ubuntu的环境中验证。
自定义镜像安装RDMA软件包
- 执行如下命令安装测试软件包。
Plain Text
1apt update && apt install -y infiniband-diags
- 使用
ibstatus命令查看网卡速率。这里我们测试的是A800实例。可以看到本例中网卡(mlx5_1)速率(rate)为100Gb/s,这是符合预期的。
Plain Text
1# ibstatus
2Infiniband device 'mlx5_0' port 1 status:
3 default gid: 0000:0000:0000:0000:0000:0000:0000:0000
4 base lid: 0x0
5 sm lid: 0x0
6 state: 4: ACTIVE
7 phys state: 5: LinkUp
8 rate: 100 Gb/sec (4X EDR)
9 link_layer: Ethernet
10
11Infiniband device 'mlx5_1' port 1 status:
12 default gid: 0000:0000:0000:0000:0000:0000:0000:0000
13 base lid: 0x0
14 sm lid: 0x0
15 state: 4: ACTIVE
16 phys state: 5: LinkUp
17 rate: 100 Gb/sec (4X EDR)
18 link_layer: Ethernet
19
20Infiniband device 'mlx5_2' port 1 status:
21 default gid: 0000:0000:0000:0000:0000:0000:0000:0000
22 base lid: 0x0
23 sm lid: 0x0
24 state: 4: ACTIVE
25 phys state: 5: LinkUp
26 rate: 100 Gb/sec (4X EDR)
27 link_layer: Ethernet
- 执行如下命令检查是否安装 RDMA 相关库。
Plain Text
1dpkg -l perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1
输出示例
Plain Text
1# dpkg -l perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1
2dpkg-query: no packages found matching perftest
3Desired=Unknown/Install/Remove/Purge/Hold
4| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
5|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
6||/ Name Version Architecture Description
7+++-=======================-============-============-===========================================================
8ii ibverbs-providers:amd64 39.0-1 amd64 User space provider drivers for libibverbs
9ii libibumad3:amd64 39.0-1 amd64 InfiniBand Userspace Management Datagram (uMAD) library
10ii libibverbs1:amd64 39.0-1 amd64 Library for direct userspace use of RDMA (InfiniBand/iWARP)
11ii libnl-3-200:amd64 3.5.0-0.1 amd64 library for dealing with netlink sockets
12ii libnl-route-3-200:amd64 3.5.0-0.1 amd64 library for dealing with netlink sockets - route interface
13ii librdmacm1:amd64 39.0-1 amd64 Library for managing RDMA connections
上述输出信息中包含了已安装(如ibverbs-providers:amd64、libibumad3:amd64等)和未安装(perftest)的软件。 如有软件包未安装,请继续执行第4步的操作安装软件;如已经安装全部软件,则可以直接验证是否支持RDMA
- 执行命令安装上述软件包
Plain Text
1apt update && apt install -y perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1
- 执行如下命令再次查看软件包安装情况
Plain Text
1dpkg -l perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1
输出示例
Plain Text
1# dpkg -l perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1
2Desired=Unknown/Install/Remove/Purge/Hold
3| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
4|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
5||/ Name Version Architecture Description
6+++-=======================-============-============-===========================================================
7ii ibverbs-providers:amd64 39.0-1 amd64 User space provider drivers for libibverbs
8ii libibumad3:amd64 39.0-1 amd64 InfiniBand Userspace Management Datagram (uMAD) library
9ii libibverbs1:amd64 39.0-1 amd64 Library for direct userspace use of RDMA (InfiniBand/iWARP)
10ii libnl-3-200:amd64 3.5.0-0.1 amd64 library for dealing with netlink sockets
11ii libnl-route-3-200:amd64 3.5.0-0.1 amd64 library for dealing with netlink sockets - route interface
12ii librdmacm1:amd64 39.0-1 amd64 Library for managing RDMA connections
13ii perftest 4.4+0.37-1 amd64 Infiniband verbs performance tests
