Friday, March 30, 2012

std::sort

I tackled accelerating the STL sorting routine on the GPU this week. Similar to list.sort(), this method lends itself to performance gains through GPU acceleration.

The STL sort method operates on RandomAccessIterators. However, this does not necessarily mean that the data is laid out sequentially in memory, but we can test for this property to allow a direct copy to the GPU rather than using iterators.


T * start = &(*first);
T * end = &(*(last-1)) + 1;

if(((char*)end) - ((char*)start) == total_size * (sizeof(T))){

thrust::copy(start,end,device.begin());

}else{
// Iterator and copy one at a time
}


With this optimization, we can avoid copying into an intermediate structure first or calling copy multiple times. This will allow better performance for (at least) vectors, heap arrays, and stack arrays.

With a simple sorting microbenchmark, I compared the performance of the sorting algorithm on the CPU and the GPU for the int, float, and long data types. The graph below shows the speedup obtained by the GPU sort algorithm on both vectors and lists.



There are a few important observations to draw from this data. First, the GPU implementation of the vector sorting algorithm offers a smaller speedup than the list sorting algorithm, mainly due to the CPU vector sort being highly optimized. Second, the data revealed that, with GPU acceleration, the runtime difference between vector sort and list sort is much less than when they are executed on the CPU. Thus, at a large enough data size, the choice of a data structure can depend less on the need for a fast sorting algorithm and more on the other needs of the application.

No comments:

Post a Comment