CUB
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Groups
Static Public Methods | List of all members
cub::DeviceRunLengthEncode Struct Reference

Detailed description

DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory.

run_length_encode_logo.png
.
Overview
A run-length encoding computes a simple compressed representation of a sequence of input elements such that each maximal "run" of consecutive same-valued data items is encoded as a single data value along with a count of the elements in that run.
Usage Considerations
  • Dynamic parallelism. DeviceRunLengthEncode methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Performance
The work-complexity of run-length encode as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
The following chart illustrates DeviceRunLengthEncode::RunLengthEncode performance across different CUDA architectures for int32 items. Segments have lengths uniformly sampled from [1,1000].
rle_int32_len_500.png
Performance plots for other scenarios can be found in the detailed method descriptions below.

Definition at line 78 of file device_run_length_encode.cuh.

Static Public Methods

template<typename InputIteratorT , typename UniqueOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
Encode (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, UniqueOutputIteratorT d_unique_out, LengthsOutputIteratorT d_counts_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Computes a run-length encoding of the sequence d_in. More...
 
template<typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >
CUB_RUNTIME_FUNCTION static
__forceinline__ cudaError_t 
NonTrivialRuns (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OffsetsOutputIteratorT d_offsets_out, LengthsOutputIteratorT d_lengths_out, NumRunsOutputIteratorT d_num_runs_out, int num_items, cudaStream_t stream=0, bool debug_synchronous=false)
 Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence d_in. More...
 

Member Function Documentation

template<typename InputIteratorT , typename UniqueOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceRunLengthEncode::Encode ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
UniqueOutputIteratorT  d_unique_out,
LengthsOutputIteratorT  d_counts_out,
NumRunsOutputIteratorT  d_num_runs_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Computes a run-length encoding of the sequence d_in.

  • For the ith run encountered, the first key of the run and its length are written to d_unique_out[i] and d_counts_out[i], respectively.
  • The total number of runs encountered is written to d_num_runs_out.
  • The == equality operator is used to determine whether values are equivalent
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
The following charts illustrate saturated encode performance across different CUDA architectures for int32 and int64 items, respectively. Segments have lengths uniformly sampled from [1,1000].
rle_int32_len_500.png
rle_int64_len_500.png
The following charts are similar, but with segment lengths uniformly sampled from [1,10]:
rle_int32_len_5.png
rle_int64_len_5.png
Snippet
The code snippet below illustrates the run-length encoding of a sequence of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh>
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 8
int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int *d_unique_out; // e.g., [ , , , , , , , ]
int *d_counts_out; // e.g., [ , , , , , , , ]
int *d_num_runs_out; // e.g., [ ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRunLengthEncode::Encode(d_temp_storage, temp_storage_bytes, d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run encoding
cub::DeviceRunLengthEncode::Encode(d_temp_storage, temp_storage_bytes, d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items);
// d_unique_out <-- [0, 2, 9, 5, 8]
// d_counts_out <-- [1, 2, 1, 3, 1]
// d_num_runs_out <-- [5]
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
UniqueOutputIteratorT[inferred] Random-access output iterator type for writing unique output items (may be a simple pointer type)
LengthsOutputIteratorT[inferred] Random-access output iterator type for writing output counts (may be a simple pointer type)
NumRunsOutputIteratorT[inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to the input sequence of keys
[out]d_unique_outPointer to the output sequence of unique keys (one key per run)
[out]d_counts_outPointer to the output sequence of run-lengths (one count per run)
[out]d_num_runs_outPointer to total number of runs
[in]num_itemsTotal number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.

Definition at line 148 of file device_run_length_encode.cuh.

template<typename InputIteratorT , typename OffsetsOutputIteratorT , typename LengthsOutputIteratorT , typename NumRunsOutputIteratorT >
CUB_RUNTIME_FUNCTION static __forceinline__ cudaError_t cub::DeviceRunLengthEncode::NonTrivialRuns ( void *  d_temp_storage,
size_t &  temp_storage_bytes,
InputIteratorT  d_in,
OffsetsOutputIteratorT  d_offsets_out,
LengthsOutputIteratorT  d_lengths_out,
NumRunsOutputIteratorT  d_num_runs_out,
int  num_items,
cudaStream_t  stream = 0,
bool  debug_synchronous = false 
)
inlinestatic

Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence d_in.

  • For the ith non-trivial run, the run's starting offset and its length are written to d_offsets_out[i] and d_lengths_out[i], respectively.
  • The total number of runs encountered is written to d_num_runs_out.
  • The == equality operator is used to determine whether values are equivalent
  • When d_temp_storage is NULL, no work is done and the required allocation size is returned in temp_storage_bytes.
Performance
Snippet
The code snippet below illustrates the identification of non-trivial runs within a sequence of int values.
#include <cub/cub.cuh> // or equivalently <cub/device/device_run_length_encode.cuh>
// Declare, allocate, and initialize device-accessible pointers for input and output
int num_items; // e.g., 8
int *d_in; // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
int *d_offsets_out; // e.g., [ , , , , , , , ]
int *d_lengths_out; // e.g., [ , , , , , , , ]
int *d_num_runs_out; // e.g., [ ]
...
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceRunLengthEncode::NonTrivialRuns(d_temp_storage, temp_storage_bytes, d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run encoding
cub::DeviceRunLengthEncode::NonTrivialRuns(d_temp_storage, temp_storage_bytes, d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items);
// d_offsets_out <-- [1, 4]
// d_lengths_out <-- [2, 3]
// d_num_runs_out <-- [2]
Template Parameters
InputIteratorT[inferred] Random-access input iterator type for reading input items (may be a simple pointer type)
OffsetsOutputIteratorT[inferred] Random-access output iterator type for writing run-offset values (may be a simple pointer type)
LengthsOutputIteratorT[inferred] Random-access output iterator type for writing run-length values (may be a simple pointer type)
NumRunsOutputIteratorT[inferred] Output iterator type for recording the number of runs encountered (may be a simple pointer type)
Parameters
[in]d_temp_storageDevice-accessible allocation of temporary storage. When NULL, the required allocation size is written to temp_storage_bytes and no work is done.
[in,out]temp_storage_bytesReference to size in bytes of d_temp_storage allocation
[in]d_inPointer to input sequence of data items
[out]d_offsets_outPointer to output sequence of run-offsets (one offset per non-trivial run)
[out]d_lengths_outPointer to output sequence of run-lengths (one count per non-trivial run)
[out]d_num_runs_outPointer to total number of runs (i.e., length of d_offsets_out)
[in]num_itemsTotal number of associated key+value pairs (i.e., the length of d_in_keys and d_in_values)
[in]stream[optional] CUDA stream to launch kernels within. Default is stream0.
[in]debug_synchronous[optional] Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.

Definition at line 245 of file device_run_length_encode.cuh.


The documentation for this struct was generated from the following file: