Welcome to my project blog for
CIS 565:
GPU Programming and Architecture at the University of Pennsylvania. For this project, my goal is to add automatic
GPU acceleration to the C++
STL.
The C++ Standard Template Library provides containers and algorithms for use in C++ programs. As a product of the sequential era of computing, these algorithms do not take advantage of advances in parallel computation, such as an available
GPU. At large enough input sizes, the
GPU will outperform the CPU for many of the algorithms, such as sorting and searching, that are implemented in the C++
STL.
But wait, C++ programmers can just use
Thrust when they want to execute on the
GPU, so why should we bother with this project? Well, by baking
GPU execution into the
STL, C++ programmers won't be forced to learn a new library to accelerate their programs, and programs that have already been written to use the
STL can simply be recompiled with a new compiler flag to enable
GPU execution. Plus, using a library like Thrust generally requires that the programmer analyze the performance of the algorithms to determine when to execute on the
GPU and when to execute on the CPU. Instead, we can determine general heuristics that will allow us to automatically choose the correct execution environment. Thus, we can accelerate performance in the common case and leave direct use of a library like Thrust to uncommon situations.
Using Thrust, I plan to build
GPU acceleration into the algorithms provided by the C++
STL, including methods associated with the containers vector, list, and map. As an example, I've implemented
GPU acceleration for sort on std::list. The Thrust library only implements algorithms over vectors, but it is possible to first flatten a list to a vector, sort on the
GPU, and then
un-flatten the vector back into a list.
With no CPU optimizations on a simple
microbenchmark that sorts a large random list, the
GPU sort begins to dominate at a list size of 128,000 integers, as shown in the graph below.
With optimizations, the difference becomes even greater, as the copy to and from the list no longer dominates the
runtime of the
GPU accelerated algorithm. For an array of 16,000,000 integers, the CPU program takes approximately 19 minutes to execute while the
GPU version finishes in 12 seconds. This is a speedup of approximately 26x, and a C++ programmer can achieve this performance by simply adding a flag when compiling!
For this project, I plan to accelerate the methods in the following headers (some methods may be added to the interfaces):
algorithm: foreach, find, count, equal, search, sort, min_element, max_element, set_union, set_intersection, set_difference, set_symmetric_difference
vector: sort, find
list: sort
As part of the project, I will analyze the performance of these methods on multiple GPUs and investigate runtime heuristics for choosing the correct execution environment.
Lastly, here are a few links that relate to this project:
C++ STL Reference
Thrust Reference
Project Proposal (PDF)
Github Repository