libcxxgpu: April 2012

Tuesday, April 24, 2012

Final Project Post

This is the last update to this project for the requirements for CIS 565: GPU Programming and Architecture. Here are my files for submission:

Final Paper
Final Presentation
Github Repository

Monday, April 16, 2012

Instead of accelerating std::for_each, which does not allow a return value, I changed plans and accelerated std::transform instead. This function provides a great experimental platform for testing the heuristics necessary for accelerating STL computations using a GPU because the transform function accepts a nearly arbitrary function.

Taking advantage of this capability, I created a functor that executes M multiplications and used this functor to measure how the CPU and GPU versions of transform scale with the number of multiplications in the functor's operator. In the graph below, the speedup for GPU execution is shown for 10 to 100 multiplications (using a vector of 8 million floats).

From this experiment, it became clear that the runtime for the GPU version is tied to the size of the vector rather than the number of instructions in the functor. On the other hand, the runtime for the GPU version corresponds directly to the number of instructions in the functor. Comparing 10 instructions to 100 instructions on the GPU, there is only an 1.08x difference in runtime. On the other hand, there is a 72x difference in runtime on the CPU for 10 to 100 instructions.

At this point, I plan to work on heuristics for selecting CPU or GPU execution. I'll also gather performance numbers for each experiment on different GPUs.

Sunday, April 15, 2012

STL Set Operations

STL set algorithms such as set_union and set_intersection pose a potential memory safety issue under GPU acceleration. These operations accept parameters for the start and end of both input sets and the start of the output set. In the CPU version, the output iterator is simply incremented each time a new member is added to the output set. Thus, the algorithm does not need to know the size of the output container (though this implementation could possibly cause a memory error to occur.

In the GPU version of these algorithms, the output set device structure and the final copy from the GPU output to the CPU output must be sized to cover the maximum possible output size. However, there is no guarantee that the programmer has sized the CPU output container to fit the maximum output size. For example, the programmer might know that the input sets to an union computation share many common members and use a smaller output container than theoretically necessary to reflect this knowledge. If the GPU version of set_union then copies the maximum possible number of elements to the output container, an out-of-bounds memory write can occur that would not have occurred in the CPU version. Therefore, the specification for a GPU set_union would have to require that the output container can fit the maximum number of output elements. In the case that the output container is not sized properly, the behavior would be undefined, as is true in many cases in the C++ specification.

In my implementation of GPU accelerated set operations, I assume that the programmer will have properly sized the output containers for the maximum number of output elements.

Back from the World of Writing Papers

I've been absent from my project blog for the past week because I've been working on a submission to OOPSLA 2012. Now that the paper has been submitted, I can get back to working on this project.

As of today, I've completed GPU accelerated versions of the following algorithms: sort, find, min_element, max_element, set_union, set_intersection, set_difference, and set_symmetric_difference. I plan to still implement foreach, count, and equal. Once finished, I'll continue to explore runtime heuristics, analyze the performance of each algorithm, and attempt to optimize the GPU performance.

I've also gained ssh access to a server with a better GPU than the one I've been using for testing. I'll be able to get performance results and have a comparison of at least the two GPUs. I may also try to analyze a few more GPUs in the graphics lab, but this is contingent on being able to use the previously developed STL code with those machines because the current implementation is gcc version specific.

Sunday, April 1, 2012

Midpoint Slides

Here are my slides for the midpoint project presentation.

Slides