Skip to content

Commit 1021260

Browse files
committed
Make verifiable a DSO and add NAME_SUFFIX support
Build option DSO=1 generates libverifiable.so which can be used to reduce the combined binary size. Build option NAME_SUFFIX can be used to a add suffix to all generated binaries. e.g. NAME_SUFFIX=_mpi Added new make target: clean_intermediates
1 parent 501a149 commit 1021260

File tree

7 files changed

+156
-88
lines changed

7 files changed

+156
-88
lines changed

‎README.md‎

Lines changed: 22 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,33 +4,43 @@ These tests check both the performance and the correctness of [NCCL](http://gith
44

55
## Build
66

7-
To build the tests, just type `make`.
7+
To build the tests, just type `make` or `make -j`
88

9-
If CUDA is not installed in /usr/local/cuda, you may specify CUDA\_HOME. Similarly, if NCCL is not installed in /usr, you may specify NCCL\_HOME.
9+
If CUDA is not installed in `/usr/local/cuda`, you may specify `CUDA_HOME`. Similarly, if NCCL is not installed in `/usr`, you may specify `NCCL_HOME`.
1010

1111
```shell
1212
$ make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
1313
```
1414

15-
NCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set MPI=1 and set MPI\_HOME to the path where MPI is installed.
15+
NCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set `MPI=1` and set `MPI_HOME` to the path where MPI is installed.
1616

1717
```shell
1818
$ make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
1919
```
2020

21+
You can also add a suffix to the name of the generated binaries with `NAME_SUFFIX`. For example when compiling with the MPI versions you could use:
22+
23+
```shell
24+
$ make MPI=1 NAME_SUFFIX=_mpi MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
25+
```
26+
27+
This will generate test binaries with names such as `all_reduce_perf_mpi`.
28+
2129
## Usage
2230

23-
NCCL tests can run on multiple processes, multiple threads, and multiple CUDA devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=CUDA devices) will be equal to (number of processes)\*(number of threads)\*(number of GPUs per thread).
31+
NCCL tests can run on multiple processes, multiple threads, and multiple CUDA devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=CUDA devices) will be equal to `(number of processes)*(number of threads)*(number of GPUs per thread)`.
2432

2533
### Quick examples
2634

2735
Run on single node with 8 GPUs (`-g 8`), scanning from 8 Bytes to 128MBytes :
36+
2837
```shell
2938
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
3039
```
3140

3241
Run 64 MPI processes on nodes with 8 GPUs each, for a total of 64 GPUs spread across 8 nodes :
3342
(NB: The nccl-tests binaries must be compiled with `MPI=1` for this case)
43+
3444
```shell
3545
$ mpirun -np 64 -N 8 ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
3646
```
@@ -73,7 +83,7 @@ All tests support the same set of arguments :
7383

7484
### Running multiple operations in parallel
7585

76-
NCCL tests allow to partition the set of GPUs into smaller sets, each executing the same operation in parallel.
86+
NCCL tests allow to partition the set of GPUs into smaller sets, each executing the same operation in parallel.
7787
To split the GPUs, NCCL will compute a "color" for each rank, based on the `NCCL_TESTS_SPLIT` environment variable, then all ranks
7888
with the same color will end up in the same group. The resulting group is printed next to each GPU at the beginning of the test.
7989

@@ -82,13 +92,15 @@ with the same color will end up in the same group. The resulting group is printe
8292
`NCCL_TESTS_SPLIT_MASK="<value>"` is equivalent to `NCCL_TESTS_SPLIT="&<value>"`.
8393

8494
Here are a few examples:
85-
- `NCCL_TESTS_SPLIT="AND 0x7"` or `NCCL_TESTS_SPLIT="MOD 8`: On systems with 8 GPUs, run 8 parallel operations, each with 1 GPU per node (purely communicating on the network)
86-
- `NCCL_TESTS_SPLIT="OR 0x7"` or `NCCL_TESTS_SPLIT="DIV 8"`: On systems with 8 GPUs, run one operation per node, purely intra-node.
87-
- `NCCL_TESTS_SPLIT="AND 0x1"` or `NCCL_TESTS_SPLIT="MOD 2"`: Run two operations, each operation using every other rank.
95+
96+
- `NCCL_TESTS_SPLIT="AND 0x7"` or `NCCL_TESTS_SPLIT="MOD 8"`: On systems with 8 GPUs, run 8 parallel operations, each with 1 GPU per node (purely communicating over the inter-node network)
97+
98+
- `NCCL_TESTS_SPLIT="OR 0x7"` or `NCCL_TESTS_SPLIT="DIV 8"`: On systems with 8 GPUs, run one operation per node, purely intra-node.
99+
100+
- `NCCL_TESTS_SPLIT="AND 0x1"` or `NCCL_TESTS_SPLIT="MOD 2"`: Run two operations, each operation using every other rank.
88101

89102
Note that the reported bandwidth is per group, hence to get the total bandwidth used by all groups, one must multiply by the number of groups.
90103

91104
## Copyright
92105

93-
NCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2024, NVIDIA CORPORATION. All rights reserved.
94-
106+
NCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2025, NVIDIA CORPORATION. All rights reserved.

‎src/Makefile‎

Lines changed: 25 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -1,73 +1,13 @@
11
#
2-
# Copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
2+
# Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
33
#
44
# See LICENSE.txt for license information
55
#
6+
include common.mk
67

7-
CUDA_HOME ?= /usr/local/cuda
8-
PREFIX ?= /usr/local
9-
VERBOSE ?= 0
10-
DEBUG ?= 0
11-
12-
CUDA_LIB ?= $(CUDA_HOME)/lib64
13-
CUDA_INC ?= $(CUDA_HOME)/include
14-
NVCC ?= $(CUDA_HOME)/bin/nvcc
15-
CUDARTLIB ?= cudart
16-
17-
CUDA_VERSION = $(strip $(shell which $(NVCC) >/dev/null && $(NVCC) --version | grep release | sed 's/.*release //' | sed 's/\,.*//'))
18-
CUDA_MAJOR = $(shell echo $(CUDA_VERSION) | cut -d "." -f 1)
19-
CUDA_MINOR = $(shell echo $(CUDA_VERSION) | cut -d "." -f 2)
20-
21-
# Better define NVCC_GENCODE in your environment to the minimal set
22-
# of archs to reduce compile time.
23-
ifeq ($(shell test "0$(CUDA_MAJOR)" -eq 12 -a "0$(CUDA_MINOR)" -ge 8 -o "0$(CUDA_MAJOR)" -ge 13; echo $$?),0)
24-
# Include Blackwell support if we're using CUDA12.8 or above
25-
NVCC_GENCODE ?= -gencode=arch=compute_80,code=sm_80 \
26-
-gencode=arch=compute_90,code=sm_90 \
27-
-gencode=arch=compute_100,code=sm_100 \
28-
-gencode=arch=compute_120,code=sm_120 \
29-
-gencode=arch=compute_120,code=compute_120
30-
else ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 12; echo $$?),0)
31-
NVCC_GENCODE ?= -gencode=arch=compute_60,code=sm_60 \
32-
-gencode=arch=compute_61,code=sm_61 \
33-
-gencode=arch=compute_70,code=sm_70 \
34-
-gencode=arch=compute_80,code=sm_80 \
35-
-gencode=arch=compute_90,code=sm_90 \
36-
-gencode=arch=compute_90,code=compute_90
37-
else ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 11; echo $$?),0)
38-
NVCC_GENCODE ?= -gencode=arch=compute_60,code=sm_60 \
39-
-gencode=arch=compute_61,code=sm_61 \
40-
-gencode=arch=compute_70,code=sm_70 \
41-
-gencode=arch=compute_80,code=sm_80 \
42-
-gencode=arch=compute_80,code=compute_80
43-
else
44-
NVCC_GENCODE ?= -gencode=arch=compute_35,code=sm_35 \
45-
-gencode=arch=compute_50,code=sm_50 \
46-
-gencode=arch=compute_60,code=sm_60 \
47-
-gencode=arch=compute_61,code=sm_61 \
48-
-gencode=arch=compute_70,code=sm_70 \
49-
-gencode=arch=compute_70,code=compute_70
50-
endif
51-
52-
NVCUFLAGS := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++11
53-
CXXFLAGS := -std=c++11
54-
55-
LDFLAGS := -L${CUDA_LIB} -lcudart -lrt
56-
NVLDFLAGS := -L${CUDA_LIB} -l${CUDARTLIB} -lrt
57-
58-
ifeq ($(DEBUG), 0)
59-
NVCUFLAGS += -O3 -g
60-
CXXFLAGS += -O3 -g
61-
else
62-
NVCUFLAGS += -O0 -G -g
63-
CXXFLAGS += -O0 -g -ggdb3
64-
endif
65-
66-
ifneq ($(VERBOSE), 0)
67-
NVCUFLAGS += -Xcompiler -Wall,-Wextra,-Wno-unused-parameter
68-
else
69-
.SILENT:
70-
endif
8+
MPI ?= 0 # Set to 1 to enable MPI support (multi-process/multi-node)
9+
NAME_SUFFIX ?= # e.g. _mpi when using MPI=1
10+
DSO ?= 0 # Set to 1 to create and use libverifiable.so to reduce binary size
7111

7212
.PHONY: build clean
7313

@@ -92,7 +32,7 @@ DST_DIR := $(BUILDDIR)
9232
SRC_FILES := $(wildcard *.cu)
9333
OBJ_FILES := $(SRC_FILES:%.cu=${DST_DIR}/%.o)
9434
BIN_FILES_LIST := all_reduce all_gather broadcast reduce_scatter reduce alltoall scatter gather sendrecv hypercube
95-
BIN_FILES := $(BIN_FILES_LIST:%=${DST_DIR}/%_perf)
35+
BIN_FILES := $(BIN_FILES_LIST:%=${DST_DIR}/%_perf${NAME_SUFFIX})
9636

9737
build: ${BIN_FILES}
9838

@@ -103,18 +43,35 @@ TEST_VERIFIABLE_SRCDIR := ../verifiable
10343
TEST_VERIFIABLE_BUILDDIR := $(BUILDDIR)/verifiable
10444
include ../verifiable/verifiable.mk
10545

46+
.PRECIOUS: ${DST_DIR}/%.o
47+
10648
${DST_DIR}/%.o: %.cu common.h $(TEST_VERIFIABLE_HDRS)
10749
@printf "Compiling %-35s > %s\n" $< $@
10850
@mkdir -p ${DST_DIR}
10951
$(NVCC) -o $@ $(NVCUFLAGS) -c $<
11052

53+
${DST_DIR}/%$(NAME_SUFFIX).o: %.cu common.h $(TEST_VERIFIABLE_HDRS)
54+
@printf "Compiling %-35s > %s\n" $< $@
55+
@mkdir -p ${DST_DIR}
56+
$(NVCC) -o $@ $(NVCUFLAGS) -c $<
57+
11158
${DST_DIR}/timer.o: timer.cc timer.h
11259
@printf "Compiling %-35s > %s\n" $< $@
11360
@mkdir -p ${DST_DIR}
114-
$(CXX) $(CXXFLAGS) -o $@ -c timer.cc
61+
$(CXX) $(CXXFLAGS) -o $@ -c $<
11562

116-
${DST_DIR}/%_perf:${DST_DIR}/%.o ${DST_DIR}/common.o ${DST_DIR}/timer.o $(TEST_VERIFIABLE_OBJS)
63+
ifeq ($(DSO), 1)
64+
${DST_DIR}/%_perf$(NAME_SUFFIX): ${DST_DIR}/%.o ${DST_DIR}/common$(NAME_SUFFIX).o ${DST_DIR}/timer.o $(TEST_VERIFIABLE_LIBS)
65+
@printf "Linking %-35s > %s\n" $< $@
66+
@mkdir -p ${DST_DIR}
67+
$(NVCC) -o $@ $(NVCUFLAGS) $^ -L$(TEST_VERIFIABLE_BUILDDIR) -lverifiable ${NVLDFLAGS} -Xlinker "--enable-new-dtags" -Xlinker "-rpath,\$$ORIGIN:\$$ORIGIN/verifiable"
68+
else
69+
${DST_DIR}/%_perf$(NAME_SUFFIX):${DST_DIR}/%.o ${DST_DIR}/common$(NAME_SUFFIX).o ${DST_DIR}/timer.o $(TEST_VERIFIABLE_OBJS)
11770
@printf "Linking %-35s > %s\n" $< $@
11871
@mkdir -p ${DST_DIR}
11972
$(NVCC) -o $@ $(NVCUFLAGS) $^ ${NVLDFLAGS}
73+
endif
74+
75+
clean_intermediates:
76+
rm -f ${DST_DIR}/*.o $(TEST_VERIFIABLE_OBJS)
12077

‎src/common.mk‎

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
#
2+
# Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
3+
#
4+
# See LICENSE.txt for license information
5+
#
6+
CUDA_HOME ?= /usr/local/cuda
7+
PREFIX ?= /usr/local
8+
VERBOSE ?= 0
9+
DEBUG ?= 0
10+
11+
CUDA_LIB ?= $(CUDA_HOME)/lib64
12+
CUDA_INC ?= $(CUDA_HOME)/include
13+
NVCC ?= $(CUDA_HOME)/bin/nvcc
14+
CUDARTLIB ?= cudart
15+
16+
CUDA_VERSION = $(strip $(shell which $(NVCC) >/dev/null && $(NVCC) --version | grep release | sed 's/.*release //' | sed 's/\,.*//'))
17+
CUDA_MAJOR = $(shell echo $(CUDA_VERSION) | cut -d "." -f 1)
18+
CUDA_MINOR = $(shell echo $(CUDA_VERSION) | cut -d "." -f 2)
19+
20+
# Better define NVCC_GENCODE in your environment to the minimal set
21+
# of archs to reduce compile time.
22+
ifeq ($(shell test "0$(CUDA_MAJOR)" -eq 12 -a "0$(CUDA_MINOR)" -ge 8 -o "0$(CUDA_MAJOR)" -ge 13; echo $$?),0)
23+
# Include Blackwell support if we're using CUDA12.8 or above
24+
NVCC_GENCODE ?= -gencode=arch=compute_80,code=sm_80 \
25+
-gencode=arch=compute_90,code=sm_90 \
26+
-gencode=arch=compute_100,code=sm_100 \
27+
-gencode=arch=compute_120,code=sm_120 \
28+
-gencode=arch=compute_120,code=compute_120
29+
else ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 12; echo $$?),0)
30+
NVCC_GENCODE ?= -gencode=arch=compute_60,code=sm_60 \
31+
-gencode=arch=compute_61,code=sm_61 \
32+
-gencode=arch=compute_70,code=sm_70 \
33+
-gencode=arch=compute_80,code=sm_80 \
34+
-gencode=arch=compute_90,code=sm_90 \
35+
-gencode=arch=compute_90,code=compute_90
36+
else ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 11; echo $$?),0)
37+
NVCC_GENCODE ?= -gencode=arch=compute_60,code=sm_60 \
38+
-gencode=arch=compute_61,code=sm_61 \
39+
-gencode=arch=compute_70,code=sm_70 \
40+
-gencode=arch=compute_80,code=sm_80 \
41+
-gencode=arch=compute_80,code=compute_80
42+
else
43+
NVCC_GENCODE ?= -gencode=arch=compute_35,code=sm_35 \
44+
-gencode=arch=compute_50,code=sm_50 \
45+
-gencode=arch=compute_60,code=sm_60 \
46+
-gencode=arch=compute_61,code=sm_61 \
47+
-gencode=arch=compute_70,code=sm_70 \
48+
-gencode=arch=compute_70,code=compute_70
49+
endif
50+
51+
NVCUFLAGS := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++11
52+
CXXFLAGS := -std=c++11
53+
54+
LDFLAGS := -L${CUDA_LIB} -lcudart -lrt
55+
NVLDFLAGS := -L${CUDA_LIB} -l${CUDARTLIB} -lrt
56+
57+
ifeq ($(DEBUG), 0)
58+
NVCUFLAGS += -O3 -g
59+
CXXFLAGS += -O3 -g
60+
else
61+
NVCUFLAGS += -O0 -G -g
62+
CXXFLAGS += -O0 -g -ggdb3
63+
endif
64+
65+
ifneq ($(VERBOSE), 0)
66+
NVCUFLAGS += -Xcompiler -Wall,-Wextra,-Wno-unused-parameter
67+
else
68+
.SILENT:
69+
endif

‎verifiable/Makefile‎

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,18 @@
1-
include ../../makefiles/common.mk
1+
#
2+
# Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
3+
#
4+
# See LICENSE.txt for license information
5+
#
6+
include ../src/common.mk
27

38
.PHONY: all clean
49

5-
BUILDDIR := $(abspath ../../build)
10+
BUILDDIR := $(abspath ../build)
611
NCCLDIR := $(BUILDDIR)
712
NVCUFLAGS += -I$(NCCLDIR)/include/ -I../include
8-
DST_DIR := $(BUILDDIR)/test/verifiable
13+
DST_DIR := $(BUILDDIR)/verifiable
914

10-
all: $(DST_DIR)/self_test $(DST_DIR)/verifiable.o
15+
all: $(DST_DIR)/self_test
1116

1217
clean:
1318
rm -rf $(DST_DIR)
@@ -18,7 +23,7 @@ include verifiable.mk
1823

1924
self_test: $(DST_DIR)/self_test
2025

21-
$(DST_DIR)/self_test: verifiable.cu verifiable.h
26+
$(DST_DIR)/self_test: main.cu $(TEST_VERIFIABLE_LIBS)
2227
@printf "Linking %s\n" $@
2328
@mkdir -p $(DST_DIR)
24-
$(NVCC) -o $@ $(NVCUFLAGS) -DSELF_TEST=1 verifiable.cu $(NVLDFLAGS)
29+
$(NVCC) -o $@ $(NVCUFLAGS) $< -L$(TEST_VERIFIABLE_BUILDDIR) -lverifiable $(NVLDFLAGS) -Xlinker "-rpath=\$$ORIGIN"

‎verifiable/main.cu‎

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
2+
#include <cuda_runtime.h>
3+
#include <iostream>
4+
5+
#define NCCL_VERIFIABLE_SELF_TEST 1
6+
#include "verifiable.h"
7+
8+
int main(int arg_n, char **args) {
9+
std::cerr<<"You are hoping to see no output beyond this line."<<std::endl;
10+
cudaSetDevice(0);
11+
ncclVerifiableLaunchSelfTest();
12+
cudaDeviceSynchronize();
13+
return 0;
14+
}

‎verifiable/verifiable.h‎

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,4 +57,8 @@ cudaError_t ncclVerifiableVerify(
5757
int64_t *bad_elt_n, cudaStream_t stream
5858
);
5959

60+
#ifdef NCCL_VERIFIABLE_SELF_TEST
61+
void ncclVerifiableLaunchSelfTest();
62+
#endif
63+
6064
#endif

‎verifiable/verifiable.mk‎

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,18 @@
1-
# We requires both of the following paths to be set upon including this makefile
1+
# We require both of the following paths to be set upon including this makefile
22
# TEST_VERIFIABLE_SRCDIR = <points to this directory>
3-
# TEST_VERIFIABLE_BUILDDIR = <points to destination of .o file>
3+
# TEST_VERIFIABLE_BUILDDIR = <points to destination of .so file>
44

55
TEST_VERIFIABLE_HDRS = $(TEST_VERIFIABLE_SRCDIR)/verifiable.h
66
TEST_VERIFIABLE_OBJS = $(TEST_VERIFIABLE_BUILDDIR)/verifiable.o
7+
TEST_VERIFIABLE_LIBS = $(TEST_VERIFIABLE_BUILDDIR)/libverifiable.so
78

8-
$(TEST_VERIFIABLE_BUILDDIR)/verifiable.o: $(TEST_VERIFIABLE_SRCDIR)/verifiable.cu $(TEST_VERIFY_REDUCE_HDRS)
9+
$(TEST_VERIFIABLE_BUILDDIR)/verifiable.o: $(TEST_VERIFIABLE_SRCDIR)/verifiable.cu $(TEST_VERIFIABLE_HDRS)
910
@printf "Compiling %s\n" $@
1011
@mkdir -p $(TEST_VERIFIABLE_BUILDDIR)
11-
$(NVCC) -o $@ $(NVCUFLAGS) -c $(TEST_VERIFIABLE_SRCDIR)/verifiable.cu
12+
$(NVCC) -Xcompiler "-fPIC" -o $@ $(NVCUFLAGS) -c $(TEST_VERIFIABLE_SRCDIR)/verifiable.cu
13+
14+
$(TEST_VERIFIABLE_BUILDDIR)/libverifiable.so: $(TEST_VERIFIABLE_OBJS)
15+
@printf "Creating DSO %s\n" $@
16+
@mkdir -p $(TEST_VERIFIABLE_BUILDDIR)
17+
$(CC) -shared -o $@.0 $^ -Wl,-soname,$(notdir $@).0
18+
ln -sf $(notdir $@).0 $@

0 commit comments

Comments
 (0)