Monday, March 12, 2012

Project Overview

Welcome to my project blog for CIS 565: GPU Programming and Architecture at the University of Pennsylvania. For this project, my goal is to add automatic GPU acceleration to the C++ STL.

The C++ Standard Template Library provides containers and algorithms for use in C++ programs. As a product of the sequential era of computing, these algorithms do not take advantage of advances in parallel computation, such as an available GPU. At large enough input sizes, the GPU will outperform the CPU for many of the algorithms, such as sorting and searching, that are implemented in the C++ STL.

But wait, C++ programmers can just use Thrust when they want to execute on the GPU, so why should we bother with this project? Well, by baking GPU execution into the STL, C++ programmers won't be forced to learn a new library to accelerate their programs, and programs that have already been written to use the STL can simply be recompiled with a new compiler flag to enable GPU execution. Plus, using a library like Thrust generally requires that the programmer analyze the performance of the algorithms to determine when to execute on the GPU and when to execute on the CPU. Instead, we can determine general heuristics that will allow us to automatically choose the correct execution environment. Thus, we can accelerate performance in the common case and leave direct use of a library like Thrust to uncommon situations.

Using Thrust, I plan to build GPU acceleration into the algorithms provided by the C++ STL, including methods associated with the containers vector, list, and map. As an example, I've implemented GPU acceleration for sort on std::list. The Thrust library only implements algorithms over vectors, but it is possible to first flatten a list to a vector, sort on the GPU, and then un-flatten the vector back into a list.



With no CPU optimizations on a simple microbenchmark that sorts a large random list, the GPU sort begins to dominate at a list size of 128,000 integers, as shown in the graph below.



With optimizations, the difference becomes even greater, as the copy to and from the list no longer dominates the runtime of the GPU accelerated algorithm. For an array of 16,000,000 integers, the CPU program takes approximately 19 minutes to execute while the GPU version finishes in 12 seconds. This is a speedup of approximately 26x, and a C++ programmer can achieve this performance by simply adding a flag when compiling!

For this project, I plan to accelerate the methods in the following headers (some methods may be added to the interfaces):

algorithm: foreach, find, count, equal, search, sort, min_element, max_element, set_union, set_intersection, set_difference, set_symmetric_difference
vector: sort, find
list: sort

As part of the project, I will analyze the performance of these methods on multiple GPUs and investigate runtime heuristics for choosing the correct execution environment.

Lastly, here are a few links that relate to this project:

C++ STL Reference

Thrust Reference

Project Proposal (PDF)

Github Repository

No comments:

Post a Comment