Instead of accelerating std::for_each, which does not allow a return value, I changed plans and accelerated std::transform instead. This function provides a great experimental platform for testing the heuristics necessary for accelerating STL computations using a GPU because the transform function accepts a nearly arbitrary function.
Taking advantage of this capability, I created a functor that executes M multiplications and used this functor to measure how the CPU and GPU versions of transform scale with the number of multiplications in the functor's operator. In the graph below, the speedup for GPU execution is shown for 10 to 100 multiplications (using a vector of 8 million floats).
From this experiment, it became clear that the runtime for the GPU version is tied to the size of the vector rather than the number of instructions in the functor. On the other hand, the runtime for the GPU version corresponds directly to the number of instructions in the functor. Comparing 10 instructions to 100 instructions on the GPU, there is only an 1.08x difference in runtime. On the other hand, there is a 72x difference in runtime on the CPU for 10 to 100 instructions.
At this point, I plan to work on heuristics for selecting CPU or GPU execution. I'll also gather performance numbers for each experiment on different GPUs.
I'm looking forward to the heuristics.
ReplyDeleteAlso, when you do performance analysis, time both CPU-GPU data transfer and kernel execution. If data transfer is commonly the bottleneck, that is an indicator that we would use the GPU even more on systems-on-a-chip where we do not have to worry about the PCIe bus.