Friday, March 30, 2012

std::sort

I tackled accelerating the STL sorting routine on the GPU this week. Similar to list.sort(), this method lends itself to performance gains through GPU acceleration.

The STL sort method operates on RandomAccessIterators. However, this does not necessarily mean that the data is laid out sequentially in memory, but we can test for this property to allow a direct copy to the GPU rather than using iterators.


T * start = &(*first);
T * end = &(*(last-1)) + 1;

if(((char*)end) - ((char*)start) == total_size * (sizeof(T))){

thrust::copy(start,end,device.begin());

}else{
// Iterator and copy one at a time
}


With this optimization, we can avoid copying into an intermediate structure first or calling copy multiple times. This will allow better performance for (at least) vectors, heap arrays, and stack arrays.

With a simple sorting microbenchmark, I compared the performance of the sorting algorithm on the CPU and the GPU for the int, float, and long data types. The graph below shows the speedup obtained by the GPU sort algorithm on both vectors and lists.



There are a few important observations to draw from this data. First, the GPU implementation of the vector sorting algorithm offers a smaller speedup than the list sorting algorithm, mainly due to the CPU vector sort being highly optimized. Second, the data revealed that, with GPU acceleration, the runtime difference between vector sort and list sort is much less than when they are executed on the CPU. Thus, at a large enough data size, the choice of a data structure can depend less on the need for a fast sorting algorithm and more on the other needs of the application.

Monday, March 19, 2012

Accelerating std::find

One of the major benefits of the STL is the generic application of algorithms over iterators. This allows a single (though possibly template specialized) implementation of the algorithms provided. As an example, linear search on a vector and a list can be implemented in the same simple code:


template <typename _Iterator, typename T>
_Iterator find( _Iterator start, _Iterator end, T & value ){
for(; start != end; ++start){
if(*start == value){
return start;
}
}
return end;
}


When working in the STL, there are a few constraints that the programmer must be aware of regarding how these functions will be called:
  • A method may be called once and not called again
  • The iterator input may come from many different container types
Thus, it becomes difficult to optimize code for specific cases. As an example, it would be beneficial to performance when applying GPU acceleration if the STL methods knew whether or not a container had been modified from one call to the next because the method could avoid copying the data from the host to the device on the second call. However, even if the "first" and "last" iterators point to the same memory locations on consecutive calls, it is impossible to efficiently determine whether or not the data in the container has been modified.

In the case of std::find, the constraints place on the STL methods make it difficult to accelerate execution using a GPU. Using linear search, the implementation of std::find has a runtime of O(n). With an iterator, the runtime is O(n) just to copy the data to the GPU. In some cases, such as with a vector or static-sized array, we can identify that the underlying memory is continuous, and therefore do a direct copy to the GPU without using iterator operations. However, this does not provide efficient enough performance on my development system. With "-O3" CPU optimizations, the GPU version of std::find is never more efficient then the CPU version for reasonable data sizes. Thus, I will have to save further performance tuning of std::find and other O(n) STL methods for a system with a newer GPU.

However, GPU acceleration of methods like this may not be entirely impossible. As suggested in my project proposal, it may be possible to mimic Intel's Array Building Blocks and "queue" operations over containers until the result is actually used. Thus, we may be able to combine multiple copies from the CPU to GPU into a single copy and consecutively execute multiple algorithms over the same data. Using this method, we may be able to execute O(n) methods like std::find on the GPU for a performance gain. At this time, I'll leave this possibility to future work and focus on some of the more algorithmically complex methods.

Wednesday, March 14, 2012

Odd nvcc Error

While working on the STL, I encountered this error from the nvcc compiler:

kernel launches from templates are not allowed in system files

This was caused by including the Thrust headers after the line:

#pragma GCC system_header

I'm not sure why this restriction exists, and I couldn't find any documentation that explains it. I'll keep searching for an explanation and update this post if I find one.

Compiler Compatability Woes

For this project, I had planned to use libcxx as a base implementation of the STL. Unfortunately, that plan failed due to compatibility issues when mixing libcxx, Thrust, and linux. I'm developing on a server with the following configuration:

  • openSUSE Linux
  • Cuda Version 4.0
  • gcc 4.5
  • GeForce 9600 GT

Unfortunately, the stream portions of the STL have only been implemented for Mac OSX in libcxx. By itself, this isn't a problem because I could just use the old C libraries for output in any benchmarks. However, Thrust uses the IO portions of the STL that make it impossible to compile any benchmarks using libcxx on my target platform. Thus, I will be testing my implementation using the 4.5 version of stdlibc++ that was meant to work with gcc 4.5.

Monday, March 12, 2012

Project Overview

Welcome to my project blog for CIS 565: GPU Programming and Architecture at the University of Pennsylvania. For this project, my goal is to add automatic GPU acceleration to the C++ STL.

The C++ Standard Template Library provides containers and algorithms for use in C++ programs. As a product of the sequential era of computing, these algorithms do not take advantage of advances in parallel computation, such as an available GPU. At large enough input sizes, the GPU will outperform the CPU for many of the algorithms, such as sorting and searching, that are implemented in the C++ STL.

But wait, C++ programmers can just use Thrust when they want to execute on the GPU, so why should we bother with this project? Well, by baking GPU execution into the STL, C++ programmers won't be forced to learn a new library to accelerate their programs, and programs that have already been written to use the STL can simply be recompiled with a new compiler flag to enable GPU execution. Plus, using a library like Thrust generally requires that the programmer analyze the performance of the algorithms to determine when to execute on the GPU and when to execute on the CPU. Instead, we can determine general heuristics that will allow us to automatically choose the correct execution environment. Thus, we can accelerate performance in the common case and leave direct use of a library like Thrust to uncommon situations.

Using Thrust, I plan to build GPU acceleration into the algorithms provided by the C++ STL, including methods associated with the containers vector, list, and map. As an example, I've implemented GPU acceleration for sort on std::list. The Thrust library only implements algorithms over vectors, but it is possible to first flatten a list to a vector, sort on the GPU, and then un-flatten the vector back into a list.



With no CPU optimizations on a simple microbenchmark that sorts a large random list, the GPU sort begins to dominate at a list size of 128,000 integers, as shown in the graph below.



With optimizations, the difference becomes even greater, as the copy to and from the list no longer dominates the runtime of the GPU accelerated algorithm. For an array of 16,000,000 integers, the CPU program takes approximately 19 minutes to execute while the GPU version finishes in 12 seconds. This is a speedup of approximately 26x, and a C++ programmer can achieve this performance by simply adding a flag when compiling!

For this project, I plan to accelerate the methods in the following headers (some methods may be added to the interfaces):

algorithm: foreach, find, count, equal, search, sort, min_element, max_element, set_union, set_intersection, set_difference, set_symmetric_difference
vector: sort, find
list: sort

As part of the project, I will analyze the performance of these methods on multiple GPUs and investigate runtime heuristics for choosing the correct execution environment.

Lastly, here are a few links that relate to this project:

C++ STL Reference

Thrust Reference

Project Proposal (PDF)

Github Repository