libcxxgpu: gpu

Showing posts with label gpu. Show all posts

Sunday, April 15, 2012

Back from the World of Writing Papers

I've been absent from my project blog for the past week because I've been working on a submission to OOPSLA 2012. Now that the paper has been submitted, I can get back to working on this project.

As of today, I've completed GPU accelerated versions of the following algorithms: sort, find, min_element, max_element, set_union, set_intersection, set_difference, and set_symmetric_difference. I plan to still implement foreach, count, and equal. Once finished, I'll continue to explore runtime heuristics, analyze the performance of each algorithm, and attempt to optimize the GPU performance.

I've also gained ssh access to a server with a better GPU than the one I've been using for testing. I'll be able to get performance results and have a comparison of at least the two GPUs. I may also try to analyze a few more GPUs in the graphics lab, but this is contingent on being able to use the previously developed STL code with those machines because the current implementation is gcc version specific.

Friday, March 30, 2012

std::sort

I tackled accelerating the STL sorting routine on the GPU this week. Similar to list.sort(), this method lends itself to performance gains through GPU acceleration.

The STL sort method operates on RandomAccessIterators. However, this does not necessarily mean that the data is laid out sequentially in memory, but we can test for this property to allow a direct copy to the GPU rather than using iterators.


T * start = &(*first);
T * end = &(*(last-1)) + 1;

if(((char*)end) - ((char*)start) == total_size * (sizeof(T))){

  thrust::copy(start,end,device.begin());

}else{
  // Iterator and copy one at a time
}

With this optimization, we can avoid copying into an intermediate structure first or calling copy multiple times. This will allow better performance for (at least) vectors, heap arrays, and stack arrays.

With a simple sorting microbenchmark, I compared the performance of the sorting algorithm on the CPU and the GPU for the int, float, and long data types. The graph below shows the speedup obtained by the GPU sort algorithm on both vectors and lists.

There are a few important observations to draw from this data. First, the GPU implementation of the vector sorting algorithm offers a smaller speedup than the list sorting algorithm, mainly due to the CPU vector sort being highly optimized. Second, the data revealed that, with GPU acceleration, the runtime difference between vector sort and list sort is much less than when they are executed on the CPU. Thus, at a large enough data size, the choice of a data structure can depend less on the need for a fast sorting algorithm and more on the other needs of the application.

Monday, March 12, 2012

Project Overview

Welcome to my project blog for CIS 565: GPU Programming and Architecture at the University of Pennsylvania. For this project, my goal is to add automatic GPU acceleration to the C++ STL.

The C++ Standard Template Library provides containers and algorithms for use in C++ programs. As a product of the sequential era of computing, these algorithms do not take advantage of advances in parallel computation, such as an available GPU. At large enough input sizes, the GPU will outperform the CPU for many of the algorithms, such as sorting and searching, that are implemented in the C++ STL.

But wait, C++ programmers can just use Thrust when they want to execute on the GPU, so why should we bother with this project? Well, by baking GPU execution into the STL, C++ programmers won't be forced to learn a new library to accelerate their programs, and programs that have already been written to use the STL can simply be recompiled with a new compiler flag to enable GPU execution. Plus, using a library like Thrust generally requires that the programmer analyze the performance of the algorithms to determine when to execute on the GPU and when to execute on the CPU. Instead, we can determine general heuristics that will allow us to automatically choose the correct execution environment. Thus, we can accelerate performance in the common case and leave direct use of a library like Thrust to uncommon situations.

Using Thrust, I plan to build GPU acceleration into the algorithms provided by the C++ STL, including methods associated with the containers vector, list, and map. As an example, I've implemented GPU acceleration for sort on std::list. The Thrust library only implements algorithms over vectors, but it is possible to first flatten a list to a vector, sort on the GPU, and then un-flatten the vector back into a list.

With no CPU optimizations on a simple microbenchmark that sorts a large random list, the GPU sort begins to dominate at a list size of 128,000 integers, as shown in the graph below.

With optimizations, the difference becomes even greater, as the copy to and from the list no longer dominates the runtime of the GPU accelerated algorithm. For an array of 16,000,000 integers, the CPU program takes approximately 19 minutes to execute while the GPU version finishes in 12 seconds. This is a speedup of approximately 26x, and a C++ programmer can achieve this performance by simply adding a flag when compiling!

For this project, I plan to accelerate the methods in the following headers (some methods may be added to the interfaces):

algorithm: foreach, find, count, equal, search, sort, min_element, max_element, set_union, set_intersection, set_difference, set_symmetric_difference
vector: sort, find
list: sort

As part of the project, I will analyze the performance of these methods on multiple GPUs and investigate runtime heuristics for choosing the correct execution environment.

Lastly, here are a few links that relate to this project:

C++ STL Reference

Thrust Reference

Project Proposal (PDF)

Github Repository