Merge ~tai271828/+git/autotest-client-tests:mr-nv-performance-gpudirect-rdma into ~canonical-kernel-team/+git/autotest-client-tests:master

Proposed by Taihsiang Ho
Status: Merged
Merged at revision: 0bbc027ac76882d2d8c2dcbc3d36b6f45bfe651e
Proposed branch: ~tai271828/+git/autotest-client-tests:mr-nv-performance-gpudirect-rdma
Merge into: ~canonical-kernel-team/+git/autotest-client-tests:master
Diff against target: 337 lines (+295/-0)
7 files modified
ubuntu_performance_gpudirect_rdma/control (+13/-0)
ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/hosts.d/blanka (+7/-0)
ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/hosts.d/hot-koala (+7/-0)
ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/hosts.d/torchtusk (+7/-0)
ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/nvidia-peermem-test.sh (+189/-0)
ubuntu_performance_gpudirect_rdma/ubuntu_performance_gpudirect_rdma.py (+32/-0)
ubuntu_performance_gpudirect_rdma/ubuntu_performance_gpudirect_rdma.sh (+40/-0)
Reviewer Review Type Date Requested Status
Po-Hsu Lin Approve
Review via email: mp+446845@code.launchpad.net

Description of the change

This merge request will create the performance test of nvidia GPUDirect technology. At this moment, there will be only one kind of testing job: peer memory testing via infinite band. It simply make sure the GPUDirect work and show the status of performance.

The job has been tested on blanka running Jammy with linux-nvidia.

To post a comment you must log in.
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

+1 with tested code.

review: Approve
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Applied and pushed, thanks.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/ubuntu_performance_gpudirect_rdma/control b/ubuntu_performance_gpudirect_rdma/control
2new file mode 100644
3index 0000000..2e325f3
4--- /dev/null
5+++ b/ubuntu_performance_gpudirect_rdma/control
6@@ -0,0 +1,13 @@
7+AUTHOR = 'Taihsiang Ho <taihsiang.ho@canonical.com>'
8+TIME = 'SHORT'
9+NAME = 'NVIDIA GPUDirect performance test'
10+TEST_TYPE = 'client'
11+TEST_CLASS = 'kernel'
12+TEST_CATEGORY = 'Benchmark'
13+
14+DOC = """
15+Perform testing of NVIDIA GPUDirect performance test. At this moment, it is exercised with Infinite Band Peer Memory
16+ technology.
17+"""
18+
19+job.run_test_detail('ubuntu_performance_gpudirect_rdma', test_name='ib_peer_memory', tag='ib_peer_memory', timeout=1200)
20diff --git a/ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/hosts.d/blanka b/ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/hosts.d/blanka
21new file mode 100644
22index 0000000..8c7c4a8
23--- /dev/null
24+++ b/ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/hosts.d/blanka
25@@ -0,0 +1,7 @@
26+SERVER_IFACE=enp148s0
27+SERVER_IP=192.168.5.1/24
28+SERVER_IB_BDF=0000:4b:00.0
29+
30+CLIENT_IFACE=enp18s0
31+CLIENT_IP=192.168.5.2/24
32+CLIENT_IB_BDF=0000:ba:00.0
33diff --git a/ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/hosts.d/hot-koala b/ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/hosts.d/hot-koala
34new file mode 100644
35index 0000000..a76218b
36--- /dev/null
37+++ b/ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/hosts.d/hot-koala
38@@ -0,0 +1,7 @@
39+SERVER_IFACE=enp132s0
40+SERVER_IP=192.168.5.1/24
41+SERVER_IB_BDF=0000:84:00.0
42+
43+CLIENT_IFACE=ens1
44+CLIENT_IP=192.168.5.2/24
45+CLIENT_IB_BDF=0000:05:00.0
46diff --git a/ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/hosts.d/torchtusk b/ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/hosts.d/torchtusk
47new file mode 100644
48index 0000000..f8f009d
49--- /dev/null
50+++ b/ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/hosts.d/torchtusk
51@@ -0,0 +1,7 @@
52+SERVER_IFACE=eno33
53+SERVER_IP=192.168.5.1/24
54+SERVER_IB_BDF=0000:81:00.0
55+
56+CLIENT_IFACE=eno34
57+CLIENT_IP=192.168.5.2/24
58+CLIENT_IB_BDF=0000:81:00.1
59diff --git a/ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/nvidia-peermem-test.sh b/ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/nvidia-peermem-test.sh
60new file mode 100755
61index 0000000..c59c383
62--- /dev/null
63+++ b/ubuntu_performance_gpudirect_rdma/nvidia-peermem-test/nvidia-peermem-test.sh
64@@ -0,0 +1,189 @@
65+#!/bin/bash
66+#
67+# This is a smoke test for the kernel IB PeerDirect feature, intended
68+# for monitoring Ubuntu kernel updates for regressions. We
69+# don't have a unit test for just that feature, so we instead do a
70+# smoke test of Nvidia's GPUDirect feature, which uses IB PeerDirect
71+# underneath. This requires using the Nvidia driver stack.
72+#
73+# To avoid orchestrating multiple machines, we instead place 2 IB
74+# devices on the same machine in separate namespaces. Running the
75+# client and server in separate namespaces ensures that the traffic
76+# actually flows over the IB cable between the interfaces.
77+# We use ib_write_bw from the perftest package to do essentially a
78+# ping test. perftest from the archive is not configured to build against
79+# the (non-free) CUDA stack, so we must first rebuild it. This rebuild
80+# is done in a pbuilder chroot to avoid issues w/ build-dependencies
81+# installing CUDA versions that don't match the nvidia driver.
82+#
83+# Prerequisites:
84+# - nvidia-driver-<branch> package installed; nvidia driver loaded
85+# - nvidia-fabricmanager, if required, installed and started
86+# - 2 local IB ports connected back-to-back
87+#
88+# Author: dann frazier <dann.frazier@canonical.com>
89+#
90+set -e
91+set -x
92+
93+export DEBCONF_FRONTEND="noninteractive"
94+export DEBIAN_PRIORITY="critical"
95+
96+hostcfg="hosts.d/$HOSTNAME"
97+if [ -e "$hostcfg" ]; then
98+ source "$hostcfg"
99+else
100+ echo "ERROR: No configuration file found for $HOSTNAME" 1>&2
101+ exit 1
102+fi
103+
104+sudo_apt() {
105+ sudo --preserve-env=DEBCONF_FRONTEND,DEBIAN_PRIORITY apt "$@"
106+}
107+
108+cleanup() {
109+ { [ -n "$srvpid" ] && test -d "/proc/$srvpid"; } || \
110+ sudo kill "$srvpid" || /bin/true
111+ [ -z "$tmpdir" ] || rm -rf "$tmpdir"
112+ sudo ip addr del dev "$SERVER_IFACE" "$SERVER_IP" || /bin/true
113+ sudo ip netns exec peermemclient \
114+ ip addr del dev "$CLIENT_IFACE" "$CLIENT_IP" || /bin/true
115+ sudo ip netns delete peermemclient || /bin/true
116+}
117+trap cleanup EXIT
118+
119+ubuntu_mirror() {
120+ local arch
121+ arch="$(dpkg --print-architecture)"
122+ case $arch in
123+ amd64|i386)
124+ echo "http://archive.ubuntu.com/ubuntu"
125+ return
126+ ;;
127+ *)
128+ echo "http://ports.ubuntu.com/ubuntu-ports"
129+ return
130+ ;;
131+ esac
132+}
133+
134+install_cuda_perftest() {
135+ local release
136+ local components
137+ if dpkg-query -W -f '${Version}' perftest | grep -q \+cuda\.1$; then
138+ # Looks like it is already build and installed
139+ return
140+ fi
141+ release=$(lsb_release -cs)
142+ components="main universe restricted multiverse"
143+ # Rebuild perftest w/ CUDA support
144+ sudo sed -i 's/# deb-src/deb-src/' /etc/apt/sources.list
145+ sudo_apt update
146+ sudo_apt build-dep -y perftest
147+ sudo_apt install -y devscripts fakeroot pbuilder
148+ tmpdir="$(mktemp -d)"
149+ pushd "$tmpdir"
150+ apt source perftest
151+ pushd perftest-*
152+ # There's a libnvidia-compute-<branch> package for every driver
153+ # branch - each one provides a libcuda.1. dpkg-shlibdeps will
154+ # generate a dependency for which package is installed at build-time.
155+ # That will end up being whatever branch nvidia-cuda-dev was built for
156+ # - and that may not match the driver version currently loaded. Using
157+ # a mismatched libnvidia-compute/driver combo will cause ib_write_bw to
158+ # error out (803 = cudaErrorSystemDriverMismatch). Override this
159+ # dependency with the libnvidia-compute virtual package. We'll let
160+ # apt figure out the best libnvidia-compute-<branch> package to
161+ # install - it tends to pick the one that matches the installed driver.
162+ echo "libcuda 1 libnvidia-compute" >> debian/shlibs.local
163+ ver="$(dpkg-parsechangelog | grep ^Version: | cut -d' ' -f2)+cuda.1"
164+ DEBFULLNAME="Canonical Kernel Team" \
165+ DEBEMAIL="canonical-kernel-team@lists.canonical.com" \
166+ dch -v "$ver" "Rebuild with CUDA support"
167+ dpkg-buildpackage -rfakeroot -uc -us -S
168+ popd
169+ # We build in a pbuilder chroot instead of on the host because
170+ # nvidia-cuda-dev depends may pull in nvidia package versions
171+ # from branches that mismatch with the host driver branch
172+ if [ ! -f "/var/cache/pbuilder/${release}.tgz" ]; then
173+ sudo pbuilder create --distribution "$release" \
174+ --mirror "$(ubuntu_mirror)" \
175+ --components "$components" \
176+ --othermirror "deb $(ubuntu_mirror) ${release}-updates $components" \
177+ --basetgz "/var/cache/pbuilder/${release}.tgz"
178+ fi
179+ mkdir result
180+ sudo sed -i 's/^export CUDA_H_PATH=.*//' /etc/pbuilderrc
181+ echo "export CUDA_H_PATH=/usr/include/cuda.h" | sudo tee -a /etc/pbuilderrc
182+ sudo pbuilder build --basetgz "/var/cache/pbuilder/${release}.tgz" \
183+ --extrapackages nvidia-cuda-dev \
184+ --buildresult result perftest_*cuda.1.dsc
185+ sudo dpkg -i result/perftest_*cuda.1_*.deb || sudo_apt -f install -y
186+ popd
187+}
188+
189+use_cuda_needs_devid() {
190+ if ib_write_bw --help | grep use_cuda=; then
191+ return 0
192+ fi
193+ return 1
194+}
195+
196+# Avoid dpkg lock contention
197+sudo service unattended-upgrades stop || true
198+
199+install_cuda_perftest
200+
201+for ibdev in /sys/class/infiniband/*; do
202+ # is this lisp?
203+ bdf="$(basename "$(dirname "$(dirname "$(readlink "$ibdev")")")")"
204+ case "$bdf" in
205+ "$CLIENT_IB_BDF")
206+ client_ib_dev="$(basename "$ibdev")"
207+ ;;
208+ "$SERVER_IB_BDF")
209+ server_ib_dev="$(basename "$ibdev")"
210+ ;;
211+ esac
212+done
213+
214+if [ -z "$client_ib_dev" ]; then
215+ echo "ERROR: Could not find client infiniband device" 1>&2
216+ exit 1
217+fi
218+if [ -z "$server_ib_dev" ]; then
219+ echo "ERROR: Could not find server infiniband device" 1>&2
220+ exit 1
221+fi
222+
223+sudo rdma system set netns exclusive
224+sudo ip netns add peermemclient
225+sudo rdma dev set "$client_ib_dev" netns peermemclient
226+sudo ip netns exec peermemclient ip link set dev lo up
227+sudo ip link set netns peermemclient "$CLIENT_IFACE"
228+sudo ip netns exec peermemclient ip addr add dev "$CLIENT_IFACE" "$CLIENT_IP"
229+sudo ip netns exec peermemclient ip link set dev "$CLIENT_IFACE" up
230+
231+sudo ip addr add dev "$SERVER_IFACE" "$SERVER_IP"
232+sudo ip link set dev "$SERVER_IFACE" up
233+
234+sudo modprobe ib_umad # bro?
235+sudo modprobe nvidia-peermem
236+
237+sudo_apt install -y opensm
238+sudo service opensm start
239+
240+# Sometime after focal, ib_write_bw --use_cuda began requiring a device id
241+if use_cuda_needs_devid; then
242+ server_use_cuda_arg="--use_cuda=0"
243+ client_use_cuda_arg="--use_cuda=1"
244+else
245+ server_use_cuda_arg="--use_cuda"
246+ client_use_cuda_arg="--use_cuda"
247+fi
248+sudo ib_write_bw -a -d "$server_ib_dev" "$server_use_cuda_arg" &
249+srvpid=$!
250+# Give server a chance to start up
251+sleep 5
252+sudo ip netns exec peermemclient ib_write_bw -a \
253+ -d "$client_ib_dev" "${SERVER_IP%/*}" "$client_use_cuda_arg"
254diff --git a/ubuntu_performance_gpudirect_rdma/ubuntu_performance_gpudirect_rdma.py b/ubuntu_performance_gpudirect_rdma/ubuntu_performance_gpudirect_rdma.py
255new file mode 100644
256index 0000000..a21a520
257--- /dev/null
258+++ b/ubuntu_performance_gpudirect_rdma/ubuntu_performance_gpudirect_rdma.py
259@@ -0,0 +1,32 @@
260+import os
261+from autotest.client import test, utils
262+
263+p_dir = os.path.dirname(os.path.abspath(__file__))
264+sh_executable = os.path.join(p_dir, "ubuntu_performance_gpudirect_rdma.sh")
265+
266+
267+class ubuntu_performance_gpudirect_rdma(test.test):
268+ version = 1
269+
270+ def initialize(self):
271+ pass
272+
273+ def setup(self):
274+ cmd = "{} setup".format(sh_executable)
275+ utils.system(cmd)
276+
277+ def run_ib_peer_memory(self):
278+ cmd = "{} test_ib_peer_memory".format(sh_executable)
279+ utils.system(cmd)
280+
281+ def run_once(self, test_name):
282+ if test_name == "ib_peer_memory":
283+ self.run_ib_peer_memory()
284+
285+ print("")
286+ print("{} has run.".format(test_name))
287+
288+ print("")
289+
290+ def postprocess_iteration(self):
291+ pass
292diff --git a/ubuntu_performance_gpudirect_rdma/ubuntu_performance_gpudirect_rdma.sh b/ubuntu_performance_gpudirect_rdma/ubuntu_performance_gpudirect_rdma.sh
293new file mode 100755
294index 0000000..d57d171
295--- /dev/null
296+++ b/ubuntu_performance_gpudirect_rdma/ubuntu_performance_gpudirect_rdma.sh
297@@ -0,0 +1,40 @@
298+#!/usr/bin/env bash
299+#
300+# Exercising the NVIDIA GPU Direct RDMA performance testing on Ubuntu
301+#
302+
303+set -eo pipefail
304+
305+setup() {
306+ # pre-setup testing environment and necessary tools
307+ # currently there is nothing practically but will be used possibly in the future.
308+ echo "begin to pre-setup testing"
309+}
310+
311+run_test() {
312+ exe_dir=$(dirname "${BASH_SOURCE[0]}")
313+ pushd "${exe_dir}"/nvidia-peermem-test/
314+ ./nvidia-peermem-test.sh
315+ popd
316+}
317+
318+case $1 in
319+ setup)
320+ echo ""
321+ echo "[GPUDirect RDMA] On setting up necessary test environment..."
322+ echo ""
323+ setup
324+ echo ""
325+ echo "[GPUDirect RDMA] Set up necessary test environment."
326+ echo ""
327+ ;;
328+ test_ib_peer_memory)
329+ echo ""
330+ echo "[GPUDirect RDMA] On running test_ib_peer_memory..."
331+ echo ""
332+ run_test
333+ echo ""
334+ echo "[GPUDirect RDMA] Run test_ib_peer_memory."
335+ echo ""
336+ ;;
337+esac

Subscribers

People subscribed via source and target branches