As my first foray into CUDA programming I set myself a challenge to experiment with different methods of finding the total sum of a large array of numbers, trying different strategies exposed via CUDA to see how different approaches compare and what comes out fastest. A lot of my GPU experience previously on PC has been through DirectX 12 or Vulkan so this felt like a good opportunity to see how programming CUDA compares for a non-graphics or games related workload.
Source Code: https://github.com/comeradealexi/CUDAParallelAddBenchmarks/blob/main/kernel.cu
Test Setup
Each test reads from a large array of integers, adding every entry together and returning the sum. Given the parallel nature of GPUs, we use the atomicAdd function to allow separate threads and blocks to increment the final output value.
For the various test scenarios which are detailed below, we will run with a suite of different block sizes as well as the number of values each thread processes from the input array in an attempt to find the most optimal setup. Block sizes will differ from 32 up to 1024 which has an effect on occupancy as well as the tests involving shared memory.
My expectation is that we’ll get faster results as we reduce the number of times we call atomicAdd, however this isn’t quite what my results find.
With the specific test scenarios below, we will run with the following setups:
- 512MiB Buffer & 1 Kernel Run
- 4 MiB Buffer & 64/128 Kernel Runs
I wanted to test the difference in the results between a very large buffer as well as a small kernel accessed multiple times to ensure I observe the difference in the memory speed overhead. I’m expecting that the 4MiB buffer tests, ran multiple times will remain in the GPU caches and better tests how the code is performing, while I expect the very large buffers to have memory access overheads, potentially resulting in differing results.
Note: In order to ensure stable GPU frequencies, these tests have been run with the following enabled: nvidia-smi -lgc 2800
Test Scenarios
Scenario 1: Atomic Add Per Value Per Thread
Here to demonstrate a suboptimal setup, here we call atomicAdd on every value we loop over per thread.
template<unsigned int VALUES_PER_THREAD>
__global__ void Add_AtomicPerValuePerThread(int* output, const int* input)
{
const unsigned int workIndex = (threadIdx.x + (blockDim.x * blockIdx.x)) * VALUES_PER_THREAD;
for (unsigned int i = 0; i < VALUES_PER_THREAD; i++)
{
int value = input[workIndex + i];
atomicAdd(&output[0], value);
}
}
Scenario 2: Atomic Add Per Thread
Here we calculate our thread total and call atomicAdd only once per thread.
template<unsigned int VALUES_PER_THREAD>
__global__ void Add_AtomicPerThread(int* output, const int* input)
{
const unsigned int workIndex = (threadIdx.x + (blockDim.x * blockIdx.x)) * VALUES_PER_THREAD;
int total = 0;
for (unsigned int i = 0; i < VALUES_PER_THREAD; i++)
{
total += input[workIndex + i];
}
atomicAdd(&output[0], total);
}
Scenario 3: Reduce Add Sync – Atomic Add Per Warp (32 Threads)
Here we introduce the __reduce_add_sync function which sums the ‘total’ value from all threads in a warp (32 threads) and returns the same value to all threads. Then we only call atomicAdd once per warp. This means we call atomicAdd 32x less overall.
In order to use the wave intrinsic function __reduce_add_sync I had to ensure Code Generation was set to: compute_80,sm_80
template<unsigned int VALUES_PER_THREAD>
__global__ void Add_ReduceAddSyncPerWarp(int* output, const int* input)
{
const unsigned int workIndex = (threadIdx.x + (blockDim.x * blockIdx.x)) * VALUES_PER_THREAD;
int total = 0;
for (unsigned int i = 0; i < VALUES_PER_THREAD; i++)
{
total += input[workIndex + i];
}
// Sum of all 32 threads in a given warp
int blockTotal = __reduce_add_sync(0xFFFFFFFF, total);
// First of every 32 threads atomically increments final output
if ((threadIdx.x & 31) == 0)
{
atomicAdd(&output[0], blockTotal);
}
}
Scenario 4: Atomic Add To Shared Memory – Atomic Add Per Block
Here every thread in the block atomically increments shared memory, then 1 thread from the block does a final atomic increment. This means we only have 1 atomicAdd per block.
template<unsigned int VALUES_PER_THREAD>
__global__ void Add_AtomicSharedPerThread(int* output, const int* input)
{
__shared__ int sharedData;
if (threadIdx.x == 0)
{
sharedData = 0;
}
__syncthreads();
const unsigned int workIndex = (threadIdx.x + (blockDim.x * blockIdx.x)) * VALUES_PER_THREAD;
int total = 0;
for (unsigned int i = 0; i < VALUES_PER_THREAD; i++)
{
total += input[workIndex + i];
}
atomicAdd(&sharedData, total);
__syncthreads();
if (threadIdx.x == 0)
{
atomicAdd(&output[0], sharedData);
}
}
Scenario 5: Add To Shared Memory Array – Atomic Add Per Block
Here instead of atomically incrementing a single shared data between threads, each increments its own shared value. Then the final thread loops over this and sums them. We expect this to potentially be inefficient because we rely on a single thread from the entire block to do a lot of processing while other threads are idle.
template<unsigned int VALUES_PER_THREAD>
__global__ void Add_SharedUniqueEntryPerThread(int* output, const int* input) // Global atomic increments = Total Threads / Block Size
{
constexpr int SHARED_DATA_LENGTH = 1024;
__shared__ int sharedData[SHARED_DATA_LENGTH];
const unsigned int workIndex = (threadIdx.x + (blockDim.x * blockIdx.x)) * VALUES_PER_THREAD;
int total = 0;
for (unsigned int i = 0; i < VALUES_PER_THREAD; i++)
{
total += input[workIndex + i];
}
sharedData[threadIdx.x] = total;
__syncthreads();
if (threadIdx.x == 0) // First thread in the block loops over all shared data.
{
int blockTotal = 0;
for (int i = 0; i < blockDim.x; i++)
{
blockTotal += sharedData[i];
}
atomicAdd(&output[0], blockTotal);
}
}
Scenario 6: Shared Memory Array + Reduce Add Sync + Atomic Add Per Block
Here we combine the features of all the other tests by using shared memory and __reduce_add_sync
template<unsigned int VALUES_PER_THREAD, unsigned int BLOCK_SIZE>
__global__ void Add_SharedWithReduceAddSyncPerWarp(int* output, const int* input) // Global atomic increments = Total Threads / Block Size
{
constexpr int SHARED_DATA_LENGTH = (BLOCK_SIZE / 32) > 0 ? (BLOCK_SIZE / 32) : 1;
__shared__ int sharedData[SHARED_DATA_LENGTH];
const unsigned int workIndex = (threadIdx.x + (blockDim.x * blockIdx.x)) * VALUES_PER_THREAD;
int total = 0;
for (unsigned int i = 0; i < VALUES_PER_THREAD; i++)
{
total += input[workIndex + i];
}
int warpTotal = __reduce_add_sync(0xFFFFFFFF, total);
if ((threadIdx.x & 31) == 0) // Each warp sums all threads in that warp.
{
// Divide by warp count
sharedData[threadIdx.x / 32] = warpTotal;
}
__syncthreads();
if (threadIdx.x == 0)
{
int blockTotal = 0;
for (unsigned int i = 0; i < SHARED_DATA_LENGTH; i++)
{
blockTotal += sharedData[i];
}
atomicAdd(&output[0], blockTotal);
}
}
Results
As you would expect, given the number of test scenarios, the differing combinations of block/grid sizes as well as our differing memory sizes, I’ve ended up with a lot of data. I’ll try to share it and analyse it in parts where the results really stood out to me.
Results have been collected on an Nvidia RTX 5060 running on Windows 11 with CUDA 13.2.1. Below are the device capabilities for the Nvidia RTX 5060:
Device 0: Name: NVIDIA GeForce RTX 5060 WarpSize: 32 MaxThreadsPerBlock: 1024 MaxGridSize(X): 2147483647 MaxGridSize(Y): 65535 MaxGridSize(Z): 65535 MaxThreadsDim(X): 1024 MaxThreadsDim(Y): 1024 MaxThreadsDim(Z): 64 SharedMemPerBlock: 49152 SharedMemPerMultiprocessor: 102400 MaxThreadsPerMultiProcessor: 1536 MaxBlocksPerMultiProcessor: 24 MultiProcessorCount: 30 SMCount: 30 MinSMPartitionSize: 8 SMCoscheduledAlignment: 8 WQConcurrencyLimit: 8
512MiB x 1 Results:
Comparing tests with same block sizes:
Here we are comparing all 6 of our test scenarios with the same block and grid sizes, so each test issues exactly the same number of threads. All of these test scenarios achieve a potential occupancy of 100%.
The part that surprised me here is just how close majority of these results are in terms of performance, especially the first three results. Add_AtomicPerThread does 512 global atomic increments per block and it’s performing better than several tests which do fewer global atomic increments.Add_AtomicPerValuePerThread is the outlier here, with such a large number of atomic increments per thread, showing an 80% increase in performance compared to our fastest scenario.
| Test < ValuesSummedPerThread, BlockSize > | Average Time (ms) | Global Memory Atomic Increments Per Block |
| Add_AtomicSharedPerThread < 16, 512 > | 5.1212 | 1 (Shared Memory Atomic Increments Per Block = 512) |
| Add_AtomicPerThread < 16, 512 > | 5.1213 | 512 |
| Add_SharedUniqueEntryPerThread < 16, 512 > | 5.2586 | 1 |
| Add_SharedWithReduceAddSyncPerWarp < 16, 512 > | 5.3104 | 1 |
| Add_ReduceAddSyncPerWarp < 16, 512 > | 5.6654 | 16 (Increment per warp which is 32 threads) |
| Add_AtomicPerValuePerThread < 16, 512 > | 9.1421 | 8,192 |
How occupancy affects results:
Focusing in on the Add_AtomicPerThread scenario, it’s quite clear here how not achieving full occupancy is having a detrimental effect on the test results, to the point where the time is more than doubled when comparing 100% to 50% occupancy. This suggests that each SM is spending time idle, likely waiting for memory operations to complete and cannot spend that time processing more warps due to the tests not achieving full occupancy.
| Test < ValuesSummedPerThread, BlockSize > | Average Time (ms) | Occupancy |
| Add_AtomicPerThread < 1, 32 > | 20.0118 | 50% |
| Add_AtomicPerThread < 1, 128 > | 8.4007 | 100% |
| Add_AtomicPerThread < 1, 512 > | 8.4272 | 100% |
| Add_AtomicPerThread < 1, 1024 > | 10.7840 | 66.7% |
4MiB x 64 Results:
With the 4MiB tests, where each scenario is run 64 times back to back, we are expecting very different results to the single 512MiB tests because we expect the input memory to always remain within the GPU caches, so less time is spent waiting on memory operations. We’ll start by comparing the fastest results for each test scenario:
| Test < ValuesSummedPerThread, BlockSize > | Average Time (ms) | Global Memory Atomic Increments Per Block |
| Add_AtomicSharedPerThread < 4, 128 > | 0.8802 | 1 (Shared Memory Atomic Increments Per Block = 128) |
| Add_SharedWithReduceAddSyncPerWarp < 4, 128 > | 0.9538 | 1 |
| Add_ReduceAddSyncPerWarp < 8, 64 > | 1.0817 | 2 (Increment per warp which is 32 threads) |
| Add_SharedUniqueEntryPerThread < 4, 128 > | 1.1001 | 1 |
| Add_AtomicPerThread < 8, 512 > | 1.1047 | 512 |
| Add_AtomicPerValuePerThread < 1, 512 > | 4.1136 | 512 |
The results here align more with our initial expectations where the performance scales with the number of atomic increments per thread. But the difference is only small, comparing the fastest Add_AtomicSharedPerThread to to Add_AtomicPerThread, there are 512x more global atomic increments per thread, but performance is only 1.25x slower.
Importance of keeping the GPU saturated
The results below show the how increasing the amount of work done per-thread decreases performance. Each SM on the GPU is designed to have many blocks assigned to it, ready to execute when other threads become blocked in order to keep the GPU constantly working. If there aren’t enough threads available to the SMs, we just end up wasting performance.
| Test < ValuesSummedPerThread, BlockSize > | Average Time (ms) | Total Threads |
| Add_AtomicSharedPerThread< 4, 512> | 2.2088 | 131,072 |
| Add_AtomicSharedPerThread< 8, 512> | 2.2291 | 65,536 |
| Add_AtomicSharedPerThread< 16, 512> | 4.1010 | 32,768 |
| Add_AtomicSharedPerThread< 32, 512> | 7.9956 | 16,384 |
| Add_AtomicSharedPerThread< 64, 512> | 8.7155 | 8,192 |
| Add_AtomicSharedPerThread< 128, 512> | 10.8872 | 4,096 |
| Add_AtomicSharedPerThread< 256, 512> | 14.6373 | 2,048 |
| Add_AtomicSharedPerThread< 512, 512> | 13.9761 | 1,024 |
| Add_AtomicSharedPerThread< 1024, 512> | 26.4963 | 512 |
| Add_AtomicSharedPerThread< 2048, 512> | 51.9823 | 256 |
| Add_AtomicSharedPerThread< 4096, 512> | 102.4490 | 128 |
| Add_AtomicSharedPerThread< 8192, 512> | 199.3275 | 64 |
| Add_AtomicSharedPerThread< 32768, 128> | 198.4756 | 16 |
Many of the other tests show similar patterns to the above as we increase the amount of work per-thread but often there is a sweet spot. Take a look at the results below, the fastest test is when we are processing 8 values per thread, anything above or below this is slower. Like with many things, it really is a delicate balancing act when you are trying to find the most optimal performance setup. We need to spawn enough threads to keep the GPU saturated, but the sweet spot here is where we can keep the GPU fully saturated but allow each thread to do as much work as possible.
| Test < ValuesSummedPerThread, BlockSize > | Average Time (ms) | Total Threads |
| Add_ReduceAddSyncPerWarp< 1, 64> | 8.4175 | 4,194,304 |
| Add_ReduceAddSyncPerWarp< 2, 64> | 4.8579 | 2,097,152 |
| Add_ReduceAddSyncPerWarp< 4, 64> | 2.9825 | 1,048,576 |
| Add_ReduceAddSyncPerWarp< 8, 64> | 2.2488 | 524,288 |
| Add_ReduceAddSyncPerWarp< 16, 64> | 3.8810 | 262,144 |
| Add_ReduceAddSyncPerWarp< 32, 64> | 7.5115 | 131,072 |
Trace Metrics
512MiB vs 4MiB:
I hypothesised earlier that the performance difference I observed between the 512MiB and 4MiB tests were due to memory bandwidth, lets see if we can prove that using hardware metrics provided by Nvidia Nsight Systems.
We’ll compare the simple Add_AtomicPerThread<8,512> test with 512MiB x 1 run and 4MiB x 128 runs:
The results show as huge difference in the throughput between the two tests, with the 512MiB test at nearly 95% throughput on VRAM while the 4MiB test is closer to 1%. L1TEX and VidL2 throughput are both also much higher compared to the 512MiB, a good indication that the data we are processing is remaining in the caches and achieving better throughput as a result.
Add_SharedWithReduceAddSyncPerWarp vs Add_AtomicPerThread:
Next, lets compare two very different functions running the 4MiB x 128 test setup and see how the trace metrics look. On average, Add_SharedWithReduceAddSyncPerWarp is faster than Add_AtomicPerThread so it’s interesting to observe that Add_AtomicPerThread has better throughput for L1TEX as well as much better overall warp occupancy. I think it’s important to think about what the throughput stats actually mean, really it’s pointing to us the units which are under the most pressure. We saw above in the 512MiB vs 4MiB test that the VRAM throughput was 95% vs 1%, and this explains the huge difference in performance there, but just because we were achieving 95% throughput in that unit, does not mean it’s faster.
Conclusions
The main objective here was a personal objective of mine to introduce myself to CUDA programming, understand how it’s written, how to profile it and performance characteristics.
I think one thing that’s shown here is how in order to improve performance of your application, you must target what the primary bottleneck is. Simply reducing global atomic operations overall may not be the biggest improvement you can make, if it isn’t your primary bottleneck. Often, one of the most important things you can do is always try to keep the SM units fully saturated with as much work as possible and spread your workload evenly across as many threads as possible.
Something else that has stood out here is that using shared memory is very fast. Often the test scenarios which are more complex, involving more code and doing atomic writes to shared memory are amongst the fastest.
Further Investigations:
- Why is the occupancy so far off 100% for the traces above? What could be done to improve this further?
- Are there other CUDA features that could improve our parallel add further? Such as using CUDA Tiles or Clusters? Or Tensor features?
- How would these results differ on different GPUs?
- Are there more trace statistics we can gather to better understand our results, or better understand our existing trace data?
Source Code is accessible here: https://github.com/comeradealexi/CUDAParallelAddBenchmarks/blob/main/kernel.cu










Leave a Reply