Modern scientific computing typically leverages GPU-powered parallel processing cores to speed up large-scale applications. This chapters discusses how to implement heterogeneous decomposition algorithms using CPU-GPU collaborative tasking.
Create a cudaFlow
Cpp-Taskflow enables concurrent CPU-GPU tasking by leveraging CUDA Graph. The tasking interface is referred to as cudaFlow. A cudaFlow is a graph object of type tf::cudaFlow created at runtime similar to dynamic tasking. It manages a task node in a taskflow and associates it with a CUDA Graph. To create a cudaFlow, emplace a callable with an argument of type tf::cudaFlow. The following example implements the canonical saxpy (A·X Plus Y) task graph.
1: #include <taskflow/taskflow.hpp>
2:
3: // saxpy (single-precision A·X Plus Y) kernel
4: __global__ void saxpy(int n, float a, float *x, float *y) {
Lines 22-23 declare two device vector pointers, dx and dy
Lines 25-31 declare two tasks to allocate memory for dx and dy on device, each of N*sizeof(float) bytes
Lines 33-45 create a cudaFlow to capture kernel work in a graph (two host-to-device data transfer tasks, one saxpy kernel task, and two device-to-host data transfer tasks)
Lines 46-48 define the task dependency between host tasks and the cudaFlow tasks and execute the taskflow
Cpp-Taskflow does not expend unnecessary efforts on kernel programming but focus on tasking CUDA operations with CPU work. We give users full privileges to craft a CUDA kernel that is commensurate with their domain knowledge. Users focus on developing high-performance kernels using a native CUDA toolkit, while leaving difficult task parallelism to Cpp-Taskflow.
Compile a cudaFlow Program
Use nvcc (at least v10) to compile a cudaFlow program:
Our source autonomously enables cudaFlow when detecting a CUDA compiler.
Configure the Number of GPU workers
By default, the executor spawns one worker per GPU. We dedicate a worker set to each heterogeneous domain, for example, host domain and CUDA domain. If your systems has 4 CPU cores and 2 GPUs, the default number of workers spawned by the executor is 4+2, where 4 workers run CPU tasks and 2 workers run GPU tasks (cudaFlow). You can construct an executor with different numbers of GPU workers.
tf::Executor executor(17, 8); // 17 CPU workers and 8 GPU workers
The above executor spawns 17 and 8 workers for running CPU and GPU tasks, respectively. These workers coordinate with each other to balance the load in a work-stealing loop highly optimized for performance.
Run a cudaFlow on Multiple GPUs
You can run a cudaFlow on multiple GPUs by explicitly associating a cudaFlow or a kernel task with a CUDA device. A CUDA device is an integer number in the range of [0, N) representing the identifier of a GPU, where N is the number of GPUs in a system. The code below creates a cudaFlow that runs on the GPU device 2 through my_stream.
Even if cudaFlow provides interface for device placement, it is users' responsibility to ensure correct memory access. For example, you may not allocate a memory block on GPU 2 using cudaMalloc and access it from a kernel on GPU 1. A safe practice is to allocate unified memory blocks using cudaMallocManaged and let the CUDA runtime perform automatic memory migration between processors (as demonstrated in the code example above).