Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

OLCF Seminar

A comparison of several accelerator programming models.
by

Jeff Larkin

on 30 April 2010

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of OLCF Seminar

A Comparison of Accelerator Programming Models Jeff Larkin
Cray, Inc.
larkin@cray.com Background Original Code Increased Parallelism Pull in q and k loops to increase parallelism
Split the function in two, a launcher and a kernel
The launcher will serve as a drop-in replacement and handle data movement and kernel lauching
The kernel performs the actual calculations
Iteratively improve on-device memory usage and thread blocking


We'll use this plan again when we get to OpenCL Directive-izing PGI Directives HMPP Directives Start with CPU kernel and add directives around outermost loop
Introduce temporary arrays for data in structures, as necessary
Set data movement parameters
Tune blocking via directives
May still need to rewrite nested loops for best performance Untuned PGI Kernel Broken Tuned Kernel Compiler doesn't give correct answers when "tuned" (Reported to PGI) Rewritten Kernel
(Thanks Oscar) Thoughts on PGI Directives Fairly simple syntax to learn (OpenMP-like)
Some code modification may still be required
Correctness took some massaging with each kernel

Performance tuning takes additional effort and is not necessarily transferable between kernels
Data regions can't span multiple functions or subroutines
I felt that the directives took longer, resulting in lower productivity and gave worse results than CUDA
Lacks maturity necessary to recommend for production This will likely improve as product matures. CUDA Available for C/C++ from Nvidia or Fortran from PGI
Only available on for Nvidia GPUs
Requires writing a thread "kernel" and managing data movement and blocking
Freely available and well-documented
Lots of courses and tutorials available on the web
Taught at many universities
Fairly mature, but still changing Available PGI compilers for C/C++ and Fortran
Currently only available for Nvidia GPUs, but other accelerators will likely be added
Directives/Pragmas are added to existing code to identify accelerator "regions," much like OpenMP regions
Currently immature, but rapidly improving PGI Directives HMPP Directives Preprocessed from C/C++ and Fortran sources
Supports several compilers and devices, including multi-core CPUs
Directives/Pragmas are added to existing code to identify accelerator "codelets"
Designed to support higher-level parallelism than PGI directives
Requires the programmer to be more explicit than PGI directives
Currently immature, but rapidly improving OpenCL Generic API designed and maintained by a committee
Currently only available for C/C++
Designed for maximum portability to a varaiety of architectures
Closely resembles the CUDA "Driver API"
Requires a thread kernel that is nearly identical to CUDA
Requires explicity management of data movement and blocking
Still young, documentation and tutorials improving CPU vs. GPU General Purpose, well-understood
Directly Connected to Memory
Large, Managed Cache
Cores work independently
Lots of silicon for legacy support and large cache Special purpose for data-intensive, parallel code
All data must be copied via slow PCIe bus
User-managed caching
All cores work in lockstep within groups
Lots of silicon for simple compute cores GPU Memory All threads share the same register file, so thread count is important. Small, but very fast "L1" accessible by all threads in a block. User Managed. "Main Memory" for all GPU kernels. Slowest, but largest device memory. CPU code must copy all data here. Programmers must think about how best to reduce transfer to/from device and how best to use the memory on the device. Grids, Blocks, and Threads A THREAD is an independent element of work and maps to a hardware core.

A BLOCK is a 1D, 2D, or 3D group of related threads that share "fast" memory. Synchronization is possible within blocks.

A GRID is a 1D or 2D grouping of blocks running the same kernel. No synchronization is possible between blocks in a grid. When one block stalls, the hardware will switch to another, so there is no guaranteed order of operations. Programming a GPU,
Regardless of Programming Model Expose a lot of parallelism, what is done 1000's of times, each time independent of the others? Can I increase parallelism by looking up the callstack?

Think about how to reduce data transfer and increase data reuse. Is it cheaper to compute than to copy? What can I leave on the device for later?

Avoid "branchy" code. Can the branches be written out?
Tuned Kernel Workaround Tuned HMPP Kernel Directives Plan HMPP Kernel Progress So Far GPU (untuned): 3.5288572311401370
GPU (shared): 2.6597976684570310
GPU (constant): 2.6611089706420900
GPU (C): 2.5274753570556640
GPU (C constant): 2.4623394012451170
ACC (untuned): 106.4257860183716000
GPU (untuned): 3.5288572311401370
GPU (shared): 2.6597976684570310
GPU (constant): 2.6611089706420900
GPU (C): 2.5274753570556640
GPU (C constant): 2.4623394012451170
ACC (untuned): 106.4257860183716000
ACC (broken): 2.8429031372070310
ACC (tuned): 3.0396223068237300
ACC (oscar): 2.9479265213012700 Progress So Far Progress So Far GPU (untuned): 3.5288572311401370
GPU (shared): 2.6597976684570310
GPU (constant): 2.6611089706420900
GPU (C): 2.5274753570556640
GPU (C constant): 2.4623394012451170
ACC (untuned): 106.4257860183716000
ACC (tuned): 3.0396223068237300
ACC (broken): 2.8429031372070310
ACC (oscar): 2.9479265213012700
HMPP (untuned): 100.1704692840576000 Progress So Far GPU (untuned): 3.5288572311401370
GPU (shared): 2.6597976684570310
GPU (constant): 2.6611089706420900
GPU (C): 2.5274753570556640
GPU (C constant): 2.4623394012451170
ACC (untuned): 106.4257860183716000
ACC (tuned): 3.0396223068237300
ACC (broken): 2.8429031372070310
ACC (oscar): 2.9479265213012700
HMPP (untuned): 100.1704692840576000
HMPP (tuned): 8.9429855346679690 Thoughts on HMPP Syntactically more difficult than PGI, but possible at a higher level
Documentation is a bit lacking
Fortran support is still fairly immature, they're working on this
Lack of 3D blocks is disappointing
Limitations on constant and shared memory support
Poor performance, is this a technology or education issue? I hear that C is better.
In my opinion, this is too immature for production use Closing Thoughts In my experience, CUDA provided the best performance and productivity of today's programming models
There is no performance benefit to choosing CUDA C or CUDA Fortran over the other
OpenCL performance is on-par with CUDA, but more challenging to work with, portability over productivity
Today's directives are still pretty immature, but improving
All of these technologies are changing rapidly, the picture could look very different in a year.

Launcher Function Untuned Kernel Shared Memory Kernel Progress So Far C for CUDA Kernel Shared Launcher Function Constant Memory Kernel Progress So Far Constant Launcher Function C Launcher Function CUDA Pros CUDA Cons Limited Portability
Learning Curve
Must maintain multiple code paths
CUDA Fortran has some module requirements that may cause some codes problems
Currently unable to cross module boundaries. Now with CUDA C and CUDA Fortran, it's possible to stay in same language.
Should be best possible performance
Most control over memory hierarchy, data movement, and synchronication CUDA-izing The Plan GPU (untuned): 3.5288572311401370
GPU (shared): 2.6597976684570310
GPU (C): 2.5274753570556640
GPU (untuned): 3.5288572311401370
GPU (shared): 2.6597976684570310
GPU (constant): 2.6611089706420900
GPU (C): 2.5274753570556640
GPU (C constant): 2.4623394012451170
OpenCL-izing OpenCL Launcher OpenCL Launcher OpenCL Kernel Thoughts on OpenCL Resulting Kernel is very similar to CUDA C with minor syntatical difference
Launching a kernel is dramatically more complex than CUDA
JIT Compilation makes development & debugging difficult
Kernel Performance on-par with CUDA, slightly more overhead though?
Reasonable for codes where portability outweighs programmer productivity
I'd expect OpenCL to make more sense for commercial software than research software. GPU (untuned): 3.5288572311401370
GPU (shared): 2.6597976684570310
GPU (constant): 2.6611089706420900
GPU (C): 2.5274753570556640
GPU (C constant): 2.4623394012451170
ACC (untuned): 106.4257860183716000
ACC (tuned): 3.0396223068237300
ACC (broken): 2.8429031372070310
ACC (oscar): 2.9479265213012700
HMPP (untuned): 100.1704692840576000
HMPP (tuned): 8.9429855346679690
OCL (untuned): 3.4917593002319340 Progress So Far Detailed Comparison Once you're on the device, there's only minimal differences between CUDA and OpenCL.
Full transcript