Cpp-Taskflow  2.4-master-branch
C6: CPU-GPU Tasking

Modern scientific computing typically leverages GPU-powered parallel processing cores to speed up large-scale applications. This chapters discusses how to implement heterogeneous decomposition algorithms using CPU-GPU collaborative tasking.

Create a cudaFlow

Cpp-Taskflow enables concurrent CPU-GPU tasking by leveraging CUDA Graph. The tasking interface is referred to as cudaFlow. A cudaFlow is a graph object of type tf::cudaFlow created at runtime similar to dynamic tasking. It manages a task node in a taskflow and associates it with a CUDA Graph. To create a cudaFlow, emplace a callable with an argument of type tf::cudaFlow. The following example implements the canonical saxpy (A·X Plus Y) task graph.

1: #include <taskflow/taskflow.hpp>
2:
3: // saxpy (single-precision A·X Plus Y) kernel
4: __global__ void saxpy(int n, float a, float *x, float *y) {
5: int i = blockIdx.x*blockDim.x + threadIdx.x;
6: if (i < n) {
7: y[i] = a*x[i] + y[i];
8: }
9: }
10:
11: // main function begins
12: int main() {
13:
14: tf::Taskflow taskflow;
15: tf::Executor executor;
16:
17: const unsigned N = 1<<20; // size of the vector
18:
19: std::vector<float> hx(N, 1.0f); // x vector at host
20: std::vector<float> hy(N, 2.0f); // y vector at host
21:
22: float *dx{nullptr}; // x vector at device
23: float *dy{nullptr}; // y vector at device
24:
25: tf::Task allocate_x = taskflow.emplace(
26: [&](){ cudaMalloc(&dx, N*sizeof(float));}
27: );
28:
29: tf::Task allocate_y = taskflow.emplace(
30: [&](){ cudaMalloc(&dy, N*sizeof(float));}
31: );
32:
33: tf::Task cudaflow = taskflow.emplace([&](tf::cudaFlow& cf) {
34: // create data transfer tasks
35: tf::cudaTask h2d_x = cf.copy(dx, hx.data(), N); // host-to-device x data transfer
36: tf::cudaTask h2d_y = cf.copy(dy, hy.data(), N); // host-to-device y data transfer
37: tf::cudaTask d2h_x = cf.copy(hx.data(), dx, N); // device-to-host x data transfer
38: tf::cudaTask d2h_y = cf.copy(hy.data(), dy, N); // device-to-host y data transfer
39:
40: // launch saxpy<<<(N+255)/256, 256, 0>>>(N, 2.0f, dx, dy)
41: tf::cudaTask kernel = cf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy);
42:
43: kernel.succeed(h2d_x, h2d_y)
44: .precede(d2h_x, d2h_y);
45: });
46: cudaflow.succeed(allocate_x, allocate_y); // overlap data allocations
47:
48: executor.run(taskflow).wait();
49:
50: taskflow.dump(std::cout); // dump the taskflow
51: }
saxpy.svg

Debrief:

  • Lines 3-9 define a saxpy kernel using CUDA
  • Lines 19-20 declare two host vectors, hx and hy
  • Lines 22-23 declare two device vector pointers, dx and dy
  • Lines 25-31 declare two tasks to allocate memory for dx and dy on device, each of N*sizeof(float) bytes
  • Lines 33-45 create a cudaFlow to capture kernel work in a graph (two host-to-device data transfer tasks, one saxpy kernel task, and two device-to-host data transfer tasks)
  • Lines 46-48 define the task dependency between host tasks and the cudaFlow tasks and execute the taskflow

Cpp-Taskflow does not expend unnecessary efforts on kernel programming but focus on tasking CUDA operations with CPU work. We give users full privileges to craft a CUDA kernel that is commensurate with their domain knowledge. Users focus on developing high-performance kernels using a native CUDA toolkit, while leaving difficult task parallelism to Cpp-Taskflow.

Compile a cudaFlow Program

Use nvcc (at least v10) to compile a cudaFlow program:

~$ nvcc my_cudaflow.cu -I path/to/include/taskflow -O2 -o my_cudaflow
~$ ./my_cudaflow

Our source autonomously enables cudaFlow when detecting a CUDA compiler.

Configure the Number of GPU workers

By default, the executor spawns one worker per GPU. We dedicate a worker set to each heterogeneous domain, for example, host domain and CUDA domain. If your systems has 4 CPU cores and 2 GPUs, the default number of workers spawned by the executor is 4+2, where 4 workers run CPU tasks and 2 workers run GPU tasks (cudaFlow). You can construct an executor with different numbers of GPU workers.

tf::Executor executor(17, 8); // 17 CPU workers and 8 GPU workers

The above executor spawns 17 and 8 workers for running CPU and GPU tasks, respectively. These workers coordinate with each other to balance the load in a work-stealing loop highly optimized for performance.

Run a cudaFlow on Multiple GPUs

You can run a cudaFlow on multiple GPUs by explicitly associating a cudaFlow or a kernel task with a CUDA device. A CUDA device is an integer number in the range of [0, N) representing the identifier of a GPU, where N is the number of GPUs in a system. The code below creates a cudaFlow that runs on the GPU device 2 through my_stream.

taskflow.emplace([](tf::cudaFlow& cf) {
cf.device(2);
cf.stream(my_stream); // by default, every cudaFlow runs on default stream 0
// adding more cudaTasks below (all tasks are placed on GPU 2 unless specified explicitly)
});

You can place a kernel on a device explicitly through the method tf::cudaFlow::kernel_on that takes the device identifier in the first argument.

1: #include <taskflow/taskflow.hpp>
2:
3: // saxpy (single-precision A·X Plus Y) kernel
4: __global__ void saxpy(int n, int a, int *x, int *y, int *z) {
5: int i = blockIdx.x*blockDim.x + threadIdx.x;
6: if (i < n) {
7: z[i] = a*x[i] + y[i];
8: }
9: }
10:
11: int main() {
12:
13: const unsigned N = 1<<20;
14:
15: int* dx {nullptr};
16: int* dy {nullptr};
17: int* z1 {nullptr};
18: int* z2 {nullptr};
19:
20: cudaMallocManaged(&dx, N*sizeof(int)); // create a unified memory block for x
21: cudaMallocManaged(&dy, N*sizeof(int)); // create a unified memory block for y
22: cudaMallocManaged(&z1, N*sizeof(int)); // result of saxpy task 1
23: cudaMallocManaged(&z2, N*sizeof(int)); // result of saxpy task 2
24:
25: for(unsigned i=0; i<N; ++i) {
26: dx[i] = 1;
27: dy[i] = 2;
28: }
29:
30: tf::Taskflow taskflow;
31: tf::Executor executor;
32:
33: taskflow.emplace([&](tf::cudaFlow& cf){
34: // launch the cudaFlow on GPU 0
35: cf.device(0);
36:
37: // launch the first saxpy kernel on GPU 1
38: cf.kernel_on(1, (N+255)/256, 256, 0, saxpy, N, 2, dx, dy, z1);
39:
40: // launch the second saxpy kernel on GPU 3
41: cf.kernel_on(3, (N+255)/256, 256, 0, saxpy, N, 2, dx, dy, z2);
42: });
43:
44: executor.run(taskflow).wait();
45:
46: cudaFree(dx);
47: cudaFree(dy);
48:
49: // verify the solution; max_error should be zero
50: int max_error = 0;
51: for (size_t i = 0; i < N; i++) {
52: max_error = std::max(max_error, abs(z1[i]-4));
53: max_error = std::max(max_error, abs(z2[i]-4));
54: }
55: std::cout << "saxpy finished with max error: " << max_error << '\n';
56: }

Debrief:

  • Lines 3-9 define a CUDA saxpy kernel that stores the result to z
  • Lines 15-23 declare four unified memory blocks accessible from any processor
  • Lines 25-28 initialize dx and dy blocks by CPU
  • Lines 33-42 create a cudaFlow task
  • Lines 34-35 associate the cudaFlow on GPU 0
  • Lines 37-38 create a kernel task to launch the first saxpy on GPU 1 and store the result in z1
  • Lines 40-41 create a kernel task to launch the second saxpy on GPU 3 and store the result in z2
  • Lines 44-55 run the taskflow and verify the result (max_error should be zero)

Running the program gives the following nvidia-smi snapshot in a system of 4 GPUs:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:18:00.0 Off | N/A |
| 32% 35C P2 68W / 250W | 163MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A |
| 33% 43C P2 247W / 250W | 293MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:86:00.0 Off | N/A |
| 32% 37C P0 72W / 250W | 10MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A |
| 31% 43C P2 245W / 250W | 293MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 53869 C ./a.out 153MiB |
| 1 53869 C ./a.out 155MiB |
| 3 53869 C ./a.out 155MiB |
+-----------------------------------------------------------------------------+

Even if cudaFlow provides interface for device placement, it is users' responsibility to ensure correct memory access. For example, you may not allocate a memory block on GPU 2 using cudaMalloc and access it from a kernel on GPU 1. A safe practice is to allocate unified memory blocks using cudaMallocManaged and let the CUDA runtime perform automatic memory migration between processors (as demonstrated in the code example above).