You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Build option DSO=1 generates libverifiable.so which can be
used to reduce the combined binary size.
Build option NAME_SUFFIX can be used to a add suffix to all
generated binaries. e.g. NAME_SUFFIX=_mpi
Added new make target: clean_intermediates
Copy file name to clipboardExpand all lines: README.md
+22-10Lines changed: 22 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,33 +4,43 @@ These tests check both the performance and the correctness of [NCCL](http://gith
4
4
5
5
## Build
6
6
7
-
To build the tests, just type `make`.
7
+
To build the tests, just type `make` or `make -j`
8
8
9
-
If CUDA is not installed in /usr/local/cuda, you may specify CUDA\_HOME. Similarly, if NCCL is not installed in /usr, you may specify NCCL\_HOME.
9
+
If CUDA is not installed in `/usr/local/cuda`, you may specify `CUDA_HOME`. Similarly, if NCCL is not installed in `/usr`, you may specify `NCCL_HOME`.
10
10
11
11
```shell
12
12
$ make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
13
13
```
14
14
15
-
NCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set MPI=1 and set MPI\_HOME to the path where MPI is installed.
15
+
NCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set `MPI=1` and set `MPI_HOME` to the path where MPI is installed.
16
16
17
17
```shell
18
18
$ make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
19
19
```
20
20
21
+
You can also add a suffix to the name of the generated binaries with `NAME_SUFFIX`. For example when compiling with the MPI versions you could use:
22
+
23
+
```shell
24
+
$ make MPI=1 NAME_SUFFIX=_mpi MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
25
+
```
26
+
27
+
This will generate test binaries with names such as `all_reduce_perf_mpi`.
28
+
21
29
## Usage
22
30
23
-
NCCL tests can run on multiple processes, multiple threads, and multiple CUDA devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=CUDA devices) will be equal to (number of processes)\*(number of threads)\*(number of GPUs per thread).
31
+
NCCL tests can run on multiple processes, multiple threads, and multiple CUDA devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=CUDA devices) will be equal to `(number of processes)*(number of threads)*(number of GPUs per thread)`.
24
32
25
33
### Quick examples
26
34
27
35
Run on single node with 8 GPUs (`-g 8`), scanning from 8 Bytes to 128MBytes :
36
+
28
37
```shell
29
38
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
30
39
```
31
40
32
41
Run 64 MPI processes on nodes with 8 GPUs each, for a total of 64 GPUs spread across 8 nodes :
33
42
(NB: The nccl-tests binaries must be compiled with `MPI=1` for this case)
@@ -73,7 +83,7 @@ All tests support the same set of arguments :
73
83
74
84
### Running multiple operations in parallel
75
85
76
-
NCCL tests allow to partition the set of GPUs into smaller sets, each executing the same operation in parallel.
86
+
NCCL tests allow to partition the set of GPUs into smaller sets, each executing the same operation in parallel.
77
87
To split the GPUs, NCCL will compute a "color" for each rank, based on the `NCCL_TESTS_SPLIT` environment variable, then all ranks
78
88
with the same color will end up in the same group. The resulting group is printed next to each GPU at the beginning of the test.
79
89
@@ -82,13 +92,15 @@ with the same color will end up in the same group. The resulting group is printe
82
92
`NCCL_TESTS_SPLIT_MASK="<value>"` is equivalent to `NCCL_TESTS_SPLIT="&<value>"`.
83
93
84
94
Here are a few examples:
85
-
-`NCCL_TESTS_SPLIT="AND 0x7"` or `NCCL_TESTS_SPLIT="MOD 8`: On systems with 8 GPUs, run 8 parallel operations, each with 1 GPU per node (purely communicating on the network)
86
-
-`NCCL_TESTS_SPLIT="OR 0x7"` or `NCCL_TESTS_SPLIT="DIV 8"`: On systems with 8 GPUs, run one operation per node, purely intra-node.
87
-
-`NCCL_TESTS_SPLIT="AND 0x1"` or `NCCL_TESTS_SPLIT="MOD 2"`: Run two operations, each operation using every other rank.
95
+
96
+
-`NCCL_TESTS_SPLIT="AND 0x7"` or `NCCL_TESTS_SPLIT="MOD 8"`: On systems with 8 GPUs, run 8 parallel operations, each with 1 GPU per node (purely communicating over the inter-node network)
97
+
98
+
-`NCCL_TESTS_SPLIT="OR 0x7"` or `NCCL_TESTS_SPLIT="DIV 8"`: On systems with 8 GPUs, run one operation per node, purely intra-node.
99
+
100
+
-`NCCL_TESTS_SPLIT="AND 0x1"` or `NCCL_TESTS_SPLIT="MOD 2"`: Run two operations, each operation using every other rank.
88
101
89
102
Note that the reported bandwidth is per group, hence to get the total bandwidth used by all groups, one must multiply by the number of groups.
90
103
91
104
## Copyright
92
105
93
-
NCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2024, NVIDIA CORPORATION. All rights reserved.
94
-
106
+
NCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2025, NVIDIA CORPORATION. All rights reserved.
0 commit comments