A grid-stride loop is a CUDA kernel pattern in which each thread processes multiple elements by stepping through the data in increments equal to the total number of threads in the grid:

for (int i = blockIdx.x * blockDim.x + threadIdx.x;
     i < n;
     i += gridDim.x * blockDim.x)
{
    // process element i
}

The naive alternative is a 1-to-1 mapping where thread handles exactly element , requiring exactly as many threads as elements. The grid-stride pattern decouples thread count from data size, allowing each thread to handle elements