Tuesday, April 24, 2012

Final Project Post

This is the last update to this project for the requirements for CIS 565:  GPU Programming and Architecture.  Here are my files for submission:


Final Paper
Final Presentation
Github Repository

Monday, April 16, 2012


Instead of accelerating std::for_each, which does not allow a return value, I changed plans and accelerated std::transform instead. This function provides a great experimental platform for testing the heuristics necessary for accelerating STL computations using a GPU because the transform function accepts a nearly arbitrary function.

Taking advantage of this capability, I created a functor that executes M multiplications and used this functor to measure how the CPU and GPU versions of transform scale with the number of multiplications in the functor's operator. In the graph below, the speedup for GPU execution is shown for 10 to 100 multiplications (using a vector of 8 million floats).

From this experiment, it became clear that the runtime for the GPU version is tied to the size of the vector rather than the number of instructions in the functor. On the other hand, the runtime for the GPU version corresponds directly to the number of instructions in the functor. Comparing 10 instructions to 100 instructions on the GPU, there is only an 1.08x difference in runtime. On the other hand, there is a 72x difference in runtime on the CPU for 10 to 100 instructions.

At this point, I plan to work on heuristics for selecting CPU or GPU execution. I'll also gather performance numbers for each experiment on different GPUs.

Sunday, April 15, 2012

STL Set Operations

STL set algorithms such as set_union and set_intersection pose a potential memory safety issue under GPU acceleration. These operations accept parameters for the start and end of both input sets and the start of the output set. In the CPU version, the output iterator is simply incremented each time a new member is added to the output set. Thus, the algorithm does not need to know the size of the output container (though this implementation could possibly cause a memory error to occur.

In the GPU version of these algorithms, the output set device structure and the final copy from the GPU output to the CPU output must be sized to cover the maximum possible output size. However, there is no guarantee that the programmer has sized the CPU output container to fit the maximum output size. For example, the programmer might know that the input sets to an union computation share many common members and use a smaller output container than theoretically necessary to reflect this knowledge. If the GPU version of set_union then copies the maximum possible number of elements to the output container, an out-of-bounds memory write can occur that would not have occurred in the CPU version. Therefore, the specification for a GPU set_union would have to require that the output container can fit the maximum number of output elements. In the case that the output container is not sized properly, the behavior would be undefined, as is true in many cases in the C++ specification.

In my implementation of GPU accelerated set operations, I assume that the programmer will have properly sized the output containers for the maximum number of output elements.

Back from the World of Writing Papers

I've been absent from my project blog for the past week because I've been working on a submission to OOPSLA 2012. Now that the paper has been submitted, I can get back to working on this project.

As of today, I've completed GPU accelerated versions of the following algorithms: sort, find, min_element, max_element, set_union, set_intersection, set_difference, and set_symmetric_difference. I plan to still implement foreach, count, and equal. Once finished, I'll continue to explore runtime heuristics, analyze the performance of each algorithm, and attempt to optimize the GPU performance.

I've also gained ssh access to a server with a better GPU than the one I've been using for testing. I'll be able to get performance results and have a comparison of at least the two GPUs. I may also try to analyze a few more GPUs in the graphics lab, but this is contingent on being able to use the previously developed STL code with those machines because the current implementation is gcc version specific.

Sunday, April 1, 2012

Friday, March 30, 2012


I tackled accelerating the STL sorting routine on the GPU this week. Similar to list.sort(), this method lends itself to performance gains through GPU acceleration.

The STL sort method operates on RandomAccessIterators. However, this does not necessarily mean that the data is laid out sequentially in memory, but we can test for this property to allow a direct copy to the GPU rather than using iterators.

T * start = &(*first);
T * end = &(*(last-1)) + 1;

if(((char*)end) - ((char*)start) == total_size * (sizeof(T))){


// Iterator and copy one at a time

With this optimization, we can avoid copying into an intermediate structure first or calling copy multiple times. This will allow better performance for (at least) vectors, heap arrays, and stack arrays.

With a simple sorting microbenchmark, I compared the performance of the sorting algorithm on the CPU and the GPU for the int, float, and long data types. The graph below shows the speedup obtained by the GPU sort algorithm on both vectors and lists.

There are a few important observations to draw from this data. First, the GPU implementation of the vector sorting algorithm offers a smaller speedup than the list sorting algorithm, mainly due to the CPU vector sort being highly optimized. Second, the data revealed that, with GPU acceleration, the runtime difference between vector sort and list sort is much less than when they are executed on the CPU. Thus, at a large enough data size, the choice of a data structure can depend less on the need for a fast sorting algorithm and more on the other needs of the application.

Monday, March 19, 2012

Accelerating std::find

One of the major benefits of the STL is the generic application of algorithms over iterators. This allows a single (though possibly template specialized) implementation of the algorithms provided. As an example, linear search on a vector and a list can be implemented in the same simple code:

template <typename _Iterator, typename T>
_Iterator find( _Iterator start, _Iterator end, T & value ){
for(; start != end; ++start){
if(*start == value){
return start;
return end;

When working in the STL, there are a few constraints that the programmer must be aware of regarding how these functions will be called:
  • A method may be called once and not called again
  • The iterator input may come from many different container types
Thus, it becomes difficult to optimize code for specific cases. As an example, it would be beneficial to performance when applying GPU acceleration if the STL methods knew whether or not a container had been modified from one call to the next because the method could avoid copying the data from the host to the device on the second call. However, even if the "first" and "last" iterators point to the same memory locations on consecutive calls, it is impossible to efficiently determine whether or not the data in the container has been modified.

In the case of std::find, the constraints place on the STL methods make it difficult to accelerate execution using a GPU. Using linear search, the implementation of std::find has a runtime of O(n). With an iterator, the runtime is O(n) just to copy the data to the GPU. In some cases, such as with a vector or static-sized array, we can identify that the underlying memory is continuous, and therefore do a direct copy to the GPU without using iterator operations. However, this does not provide efficient enough performance on my development system. With "-O3" CPU optimizations, the GPU version of std::find is never more efficient then the CPU version for reasonable data sizes. Thus, I will have to save further performance tuning of std::find and other O(n) STL methods for a system with a newer GPU.

However, GPU acceleration of methods like this may not be entirely impossible. As suggested in my project proposal, it may be possible to mimic Intel's Array Building Blocks and "queue" operations over containers until the result is actually used. Thus, we may be able to combine multiple copies from the CPU to GPU into a single copy and consecutively execute multiple algorithms over the same data. Using this method, we may be able to execute O(n) methods like std::find on the GPU for a performance gain. At this time, I'll leave this possibility to future work and focus on some of the more algorithmically complex methods.