One of the major benefits of the STL is the generic application of algorithms over iterators. This allows a single (though possibly template specialized) implementation of the algorithms provided. As an example, linear search on a vector and a list can be implemented in the same simple code:
template <typename _Iterator, typename T>
_Iterator find( _Iterator start, _Iterator end, T & value ){
for(; start != end; ++start){
if(*start == value){
return start;
}
}
return end;
}
When working in the STL, there are a few constraints that the programmer must be aware of regarding how these functions will be called:
- A method may be called once and not called again
- The iterator input may come from many different container types
Thus, it becomes difficult to optimize code for specific cases. As an example, it would be beneficial to performance when applying GPU acceleration if the STL methods knew whether or not a container had been modified from one call to the next because the method could avoid copying the data from the host to the device on the second call. However, even if the "first" and "last" iterators point to the same memory locations on consecutive calls, it is impossible to efficiently determine whether or not the data in the container has been modified.
In the case of std::find, the constraints place on the STL methods make it difficult to accelerate execution using a GPU. Using linear search, the implementation of std::find has a runtime of O(n). With an iterator, the runtime is O(n) just to copy the data to the GPU. In some cases, such as with a vector or static-sized array, we can identify that the underlying memory is continuous, and therefore do a direct copy to the GPU without using iterator operations. However, this does not provide efficient enough performance on my development system. With "-O3" CPU optimizations, the GPU version of std::find is never more efficient then the CPU version for reasonable data sizes. Thus, I will have to save further performance tuning of std::find and other O(n) STL methods for a system with a newer GPU.
However, GPU acceleration of methods like this may not be entirely impossible. As suggested in my project proposal, it may be possible to mimic Intel's
Array Building Blocks and "queue" operations over containers until the result is actually used. Thus, we may be able to combine multiple copies from the CPU to GPU into a single copy and consecutively execute multiple algorithms over the same data. Using this method, we may be able to execute O(n) methods like std::find on the GPU for a performance gain. At this time, I'll leave this possibility to future work and focus on some of the more algorithmically complex methods.