CUB
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Groups
Classes | Functions

Classes

class  cub::WarpScan< T, LOGICAL_WARP_THREADS, PTX_ARCH >
 The WarpScan class provides collective methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.

warp_scan_logo.png
.
More...
 
class  cub::WarpReduce< T, LOGICAL_WARP_THREADS, PTX_ARCH >
 The WarpReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread warp.

warp_reduce_logo.png
.
More...
 

Functions

template<typename T >
__device__ __forceinline__ T cub::ShuffleUp (T input, int src_offset, int first_lane=0)
 Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset. For thread lanes i < src_offset, the thread's own input is returned to the thread.

shfl_up_logo.png
.
More...
 
template<typename T >
__device__ __forceinline__ T cub::ShuffleDown (T input, int src_offset, int last_lane=CUB_PTX_WARP_THREADS-1)
 Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset. For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread.

shfl_down_logo.png
.
More...
 
template<typename T >
__device__ __forceinline__ T cub::ShuffleIndex (T input, int src_lane)
 Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane. For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread.

shfl_broadcast_logo.png
.
More...
 
__device__ __forceinline__ int cub::WarpAll (int cond)
 Portable implementation of __all.
 
__device__ __forceinline__ int cub::WarpAny (int cond)
 Portable implementation of __any.
 

Function Documentation

template<typename T >
__device__ __forceinline__ T cub::ShuffleUp ( input,
int  src_offset,
int  first_lane = 0 
)

Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset. For thread lanes i < src_offset, the thread's own input is returned to the thread.

shfl_up_logo.png
.

  • Available only for SM3.0 or newer
Snippet
The code snippet below illustrates each thread obtaining a double value from the predecessor of its predecessor.
#include <cub/cub.cuh> // or equivalently <cub/util_ptx.cuh>
__global__ void ExampleKernel(...)
{
// Obtain one input item per thread
double thread_data = ...
// Obtain item from two ranks below
double peer_data = ShuffleUp(thread_data, 2);
Suppose the set of input thread_data across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}. The corresponding output peer_data will be {1.0, 2.0, 1.0, 2.0, 3.0, ..., 30.0}.
Parameters
[in]inputThe value to broadcast
[in]src_offsetThe relative down-offset of the peer to read from
[in]first_laneIndex of first lane in segment

Definition at line 498 of file util_ptx.cuh.

template<typename T >
__device__ __forceinline__ T cub::ShuffleDown ( input,
int  src_offset,
int  last_lane = CUB_PTX_WARP_THREADS - 1 
)

Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset. For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread.

shfl_down_logo.png
.

  • Available only for SM3.0 or newer
Snippet
The code snippet below illustrates each thread obtaining a double value from the successor of its successor.
#include <cub/cub.cuh> // or equivalently <cub/util_ptx.cuh>
__global__ void ExampleKernel(...)
{
// Obtain one input item per thread
double thread_data = ...
// Obtain item from two ranks below
double peer_data = ShuffleDown(thread_data, 2);
Suppose the set of input thread_data across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}. The corresponding output peer_data will be {3.0, 4.0, 5.0, 6.0, 7.0, ..., 32.0}.
Parameters
[in]inputThe value to broadcast
[in]src_offsetThe relative up-offset of the peer to read from
[in]last_laneIndex of first lane in segment

Definition at line 559 of file util_ptx.cuh.

template<typename T >
__device__ __forceinline__ T cub::ShuffleIndex ( input,
int  src_lane 
)

Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane. For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread.

shfl_broadcast_logo.png
.

  • Available only for SM3.0 or newer
Snippet
The code snippet below illustrates each thread obtaining a double value from warp-lane0.
#include <cub/cub.cuh> // or equivalently <cub/util_ptx.cuh>
__global__ void ExampleKernel(...)
{
// Obtain one input item per thread
double thread_data = ...
// Obtain item from thread 0
double peer_data = ShuffleIndex(thread_data, 0);
Suppose the set of input thread_data across the first warp of threads is {1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}. The corresponding output peer_data will be {1.0, 1.0, 1.0, 1.0, 1.0, ..., 1.0}.
Parameters
[in]inputThe value to broadcast
[in]src_laneWhich warp lane is to do the broadcasting

Definition at line 663 of file util_ptx.cuh.