Tuesday, April 24, 2012

Final Project Post

This is the last update to this project for the requirements for CIS 565:  GPU Programming and Architecture.  Here are my files for submission:


Final Paper
Final Presentation
Github Repository

Monday, April 16, 2012

std::transform

Instead of accelerating std::for_each, which does not allow a return value, I changed plans and accelerated std::transform instead. This function provides a great experimental platform for testing the heuristics necessary for accelerating STL computations using a GPU because the transform function accepts a nearly arbitrary function.

Taking advantage of this capability, I created a functor that executes M multiplications and used this functor to measure how the CPU and GPU versions of transform scale with the number of multiplications in the functor's operator. In the graph below, the speedup for GPU execution is shown for 10 to 100 multiplications (using a vector of 8 million floats).



From this experiment, it became clear that the runtime for the GPU version is tied to the size of the vector rather than the number of instructions in the functor. On the other hand, the runtime for the GPU version corresponds directly to the number of instructions in the functor. Comparing 10 instructions to 100 instructions on the GPU, there is only an 1.08x difference in runtime. On the other hand, there is a 72x difference in runtime on the CPU for 10 to 100 instructions.

At this point, I plan to work on heuristics for selecting CPU or GPU execution. I'll also gather performance numbers for each experiment on different GPUs.

Sunday, April 15, 2012

STL Set Operations

STL set algorithms such as set_union and set_intersection pose a potential memory safety issue under GPU acceleration. These operations accept parameters for the start and end of both input sets and the start of the output set. In the CPU version, the output iterator is simply incremented each time a new member is added to the output set. Thus, the algorithm does not need to know the size of the output container (though this implementation could possibly cause a memory error to occur.

In the GPU version of these algorithms, the output set device structure and the final copy from the GPU output to the CPU output must be sized to cover the maximum possible output size. However, there is no guarantee that the programmer has sized the CPU output container to fit the maximum output size. For example, the programmer might know that the input sets to an union computation share many common members and use a smaller output container than theoretically necessary to reflect this knowledge. If the GPU version of set_union then copies the maximum possible number of elements to the output container, an out-of-bounds memory write can occur that would not have occurred in the CPU version. Therefore, the specification for a GPU set_union would have to require that the output container can fit the maximum number of output elements. In the case that the output container is not sized properly, the behavior would be undefined, as is true in many cases in the C++ specification.

In my implementation of GPU accelerated set operations, I assume that the programmer will have properly sized the output containers for the maximum number of output elements.

Back from the World of Writing Papers

I've been absent from my project blog for the past week because I've been working on a submission to OOPSLA 2012. Now that the paper has been submitted, I can get back to working on this project.

As of today, I've completed GPU accelerated versions of the following algorithms: sort, find, min_element, max_element, set_union, set_intersection, set_difference, and set_symmetric_difference. I plan to still implement foreach, count, and equal. Once finished, I'll continue to explore runtime heuristics, analyze the performance of each algorithm, and attempt to optimize the GPU performance.

I've also gained ssh access to a server with a better GPU than the one I've been using for testing. I'll be able to get performance results and have a comparison of at least the two GPUs. I may also try to analyze a few more GPUs in the graphics lab, but this is contingent on being able to use the previously developed STL code with those machines because the current implementation is gcc version specific.

Sunday, April 1, 2012

Friday, March 30, 2012

std::sort

I tackled accelerating the STL sorting routine on the GPU this week. Similar to list.sort(), this method lends itself to performance gains through GPU acceleration.

The STL sort method operates on RandomAccessIterators. However, this does not necessarily mean that the data is laid out sequentially in memory, but we can test for this property to allow a direct copy to the GPU rather than using iterators.


T * start = &(*first);
T * end = &(*(last-1)) + 1;

if(((char*)end) - ((char*)start) == total_size * (sizeof(T))){

thrust::copy(start,end,device.begin());

}else{
// Iterator and copy one at a time
}


With this optimization, we can avoid copying into an intermediate structure first or calling copy multiple times. This will allow better performance for (at least) vectors, heap arrays, and stack arrays.

With a simple sorting microbenchmark, I compared the performance of the sorting algorithm on the CPU and the GPU for the int, float, and long data types. The graph below shows the speedup obtained by the GPU sort algorithm on both vectors and lists.



There are a few important observations to draw from this data. First, the GPU implementation of the vector sorting algorithm offers a smaller speedup than the list sorting algorithm, mainly due to the CPU vector sort being highly optimized. Second, the data revealed that, with GPU acceleration, the runtime difference between vector sort and list sort is much less than when they are executed on the CPU. Thus, at a large enough data size, the choice of a data structure can depend less on the need for a fast sorting algorithm and more on the other needs of the application.

Monday, March 19, 2012

Accelerating std::find

One of the major benefits of the STL is the generic application of algorithms over iterators. This allows a single (though possibly template specialized) implementation of the algorithms provided. As an example, linear search on a vector and a list can be implemented in the same simple code:


template <typename _Iterator, typename T>
_Iterator find( _Iterator start, _Iterator end, T & value ){
for(; start != end; ++start){
if(*start == value){
return start;
}
}
return end;
}


When working in the STL, there are a few constraints that the programmer must be aware of regarding how these functions will be called:
  • A method may be called once and not called again
  • The iterator input may come from many different container types
Thus, it becomes difficult to optimize code for specific cases. As an example, it would be beneficial to performance when applying GPU acceleration if the STL methods knew whether or not a container had been modified from one call to the next because the method could avoid copying the data from the host to the device on the second call. However, even if the "first" and "last" iterators point to the same memory locations on consecutive calls, it is impossible to efficiently determine whether or not the data in the container has been modified.

In the case of std::find, the constraints place on the STL methods make it difficult to accelerate execution using a GPU. Using linear search, the implementation of std::find has a runtime of O(n). With an iterator, the runtime is O(n) just to copy the data to the GPU. In some cases, such as with a vector or static-sized array, we can identify that the underlying memory is continuous, and therefore do a direct copy to the GPU without using iterator operations. However, this does not provide efficient enough performance on my development system. With "-O3" CPU optimizations, the GPU version of std::find is never more efficient then the CPU version for reasonable data sizes. Thus, I will have to save further performance tuning of std::find and other O(n) STL methods for a system with a newer GPU.

However, GPU acceleration of methods like this may not be entirely impossible. As suggested in my project proposal, it may be possible to mimic Intel's Array Building Blocks and "queue" operations over containers until the result is actually used. Thus, we may be able to combine multiple copies from the CPU to GPU into a single copy and consecutively execute multiple algorithms over the same data. Using this method, we may be able to execute O(n) methods like std::find on the GPU for a performance gain. At this time, I'll leave this possibility to future work and focus on some of the more algorithmically complex methods.

Wednesday, March 14, 2012

Odd nvcc Error

While working on the STL, I encountered this error from the nvcc compiler:

kernel launches from templates are not allowed in system files

This was caused by including the Thrust headers after the line:

#pragma GCC system_header

I'm not sure why this restriction exists, and I couldn't find any documentation that explains it. I'll keep searching for an explanation and update this post if I find one.

Compiler Compatability Woes

For this project, I had planned to use libcxx as a base implementation of the STL. Unfortunately, that plan failed due to compatibility issues when mixing libcxx, Thrust, and linux. I'm developing on a server with the following configuration:

  • openSUSE Linux
  • Cuda Version 4.0
  • gcc 4.5
  • GeForce 9600 GT

Unfortunately, the stream portions of the STL have only been implemented for Mac OSX in libcxx. By itself, this isn't a problem because I could just use the old C libraries for output in any benchmarks. However, Thrust uses the IO portions of the STL that make it impossible to compile any benchmarks using libcxx on my target platform. Thus, I will be testing my implementation using the 4.5 version of stdlibc++ that was meant to work with gcc 4.5.

Monday, March 12, 2012

Project Overview

Welcome to my project blog for CIS 565: GPU Programming and Architecture at the University of Pennsylvania. For this project, my goal is to add automatic GPU acceleration to the C++ STL.

The C++ Standard Template Library provides containers and algorithms for use in C++ programs. As a product of the sequential era of computing, these algorithms do not take advantage of advances in parallel computation, such as an available GPU. At large enough input sizes, the GPU will outperform the CPU for many of the algorithms, such as sorting and searching, that are implemented in the C++ STL.

But wait, C++ programmers can just use Thrust when they want to execute on the GPU, so why should we bother with this project? Well, by baking GPU execution into the STL, C++ programmers won't be forced to learn a new library to accelerate their programs, and programs that have already been written to use the STL can simply be recompiled with a new compiler flag to enable GPU execution. Plus, using a library like Thrust generally requires that the programmer analyze the performance of the algorithms to determine when to execute on the GPU and when to execute on the CPU. Instead, we can determine general heuristics that will allow us to automatically choose the correct execution environment. Thus, we can accelerate performance in the common case and leave direct use of a library like Thrust to uncommon situations.

Using Thrust, I plan to build GPU acceleration into the algorithms provided by the C++ STL, including methods associated with the containers vector, list, and map. As an example, I've implemented GPU acceleration for sort on std::list. The Thrust library only implements algorithms over vectors, but it is possible to first flatten a list to a vector, sort on the GPU, and then un-flatten the vector back into a list.



With no CPU optimizations on a simple microbenchmark that sorts a large random list, the GPU sort begins to dominate at a list size of 128,000 integers, as shown in the graph below.



With optimizations, the difference becomes even greater, as the copy to and from the list no longer dominates the runtime of the GPU accelerated algorithm. For an array of 16,000,000 integers, the CPU program takes approximately 19 minutes to execute while the GPU version finishes in 12 seconds. This is a speedup of approximately 26x, and a C++ programmer can achieve this performance by simply adding a flag when compiling!

For this project, I plan to accelerate the methods in the following headers (some methods may be added to the interfaces):

algorithm: foreach, find, count, equal, search, sort, min_element, max_element, set_union, set_intersection, set_difference, set_symmetric_difference
vector: sort, find
list: sort

As part of the project, I will analyze the performance of these methods on multiple GPUs and investigate runtime heuristics for choosing the correct execution environment.

Lastly, here are a few links that relate to this project:

C++ STL Reference

Thrust Reference

Project Proposal (PDF)

Github Repository