Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

GPGPU - supercomputing in your pocket

Introduction into scientific graphic card programming

André Bergner

on 13 July 2010

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of GPGPU - supercomputing in your pocket

CUDA OpenCL Development of the GPUs is driven by the DirectX Specifications of MS. DirectX evolution language extension to
C / C++ / Fortran emergence
of GPGPU Wolfenstein 3D (May 1992) DOOM (Dec. 1993) Crysis (Nov. 2007) GPU
programming supercomputing in your pocket by
André Bergner the GPU parallel
computing CPU speed up is on its end due to laws of physics (since 2003 the clock speed is slowly increasing)

→ but speed improvement possible through parallelization
( number of cores is increasing rapidly [32 CPU cores by 2012] )
→ software has to adopt
sequential processing GPGPU general computing on purpose graphics processing units fixed shaders have been replaced by programmable (vertex,fragment)shaders and later by unified shaders
use shaders for some general task
data is stored in textures ( = arrays )
- the API is designed for graphics → one has
to trick around to do what is needed
- low memory transfer rates Turing Machine von Neumann Architecture parallel processing Cellular Automata NVIDIA Tesla build for supercomputing
7x increase in double precesion performance
448 cores
→ 515 Gflops (double) , 1.03 Tflops (single)
memory bandwidth: 144 GB/sec
price from ~1000 €
NVIDIA GeForce target the embedded and mobile computing market
PowerVR SGX: ~0.9 Gflops
new PowerVR SGX545 is OpenCL enabled 8/16 cores
43 Gflops
512 MB RAM
memory bandwidth: 6.4 GB/s 8400 GS vendors
standards 64 cores
254 Gflops
512 MB RAM
memory bandwidth: 51.2 GB/s 9800M GS Cray 1 N900 80 MFlops
8,8 M$
volume: ~300 l
weight: 5,5 t 1 GFlop
500 $
volume: ~0.1 l
weight: 180 g links & resources gpgpu.org [community site]
gpgpu.org/developer#reading-material [← start reading here]
http://www.nvidia.de/page/tesla_computing_solutions.html [NVidia's supercomputing solution]
developer.nvidia.com/object/gpucomputing.html [resources on CUDA]
gpucomputing.net [another community site] NVidia CUDA Compute
Architecture Get your hands dirty... nvcc compiler
toolchain code in on file:
CUDA C gnu compiler
toolchain linker executable OpenGL -
the geek way OpenGL-lib
C/C++ compiler complicated
inflexible what is needed? CUDA enabled GPU driver (emulation mode is supplied)
C/C++ compiler OpenCL what is needed? CUDA enabled GPU driver (emulation mode is supplied)
C/C++ compiler what is needed? but... what is it? the original GPGPU technique: trick the 3D-engine to compute what you want example comparison of
parallelization APIs supercomputing cluster
of 11 playstation 2
(200 GFlop · 11 = 2.2 TFlop) OpenCL targets multiple devices (GPU, multi-core CPU) within a single system OpenMP targets multi-core CPUs and SMP (symmetric multiprocessing) MPI is a message passing protocol mainly targeting computer clusters Stream DirectCompute intermediate PTX code CUDA driver CPU program GPU program compiles to target device on runtime idea of parallelization is old concepts of parallelization symmetric
(SMP) computer cluster many computers working together in a network
slow communication due to separate memory vectorization multicore CPU GPU many processors on one mainboard
shared memory approach single instruction multiple data (SIMD)
one instruction triggers several identical ALUs each accessing its own data similiar to SMP
many CPU cores share one chip (thus shared cache / shared memory) massively parallel multicore processor
reduced control logic and cache → most of the chip surface can be used for ALUs why parallelization?
example example how does it work? compile with "nvcc hello_cuda.cu" Bindings complete pycuda example Python
Java (jcuda)
computer generated images → need for speed! Hollywood computer games computer animation of the death star in star wars 1 (1977) terminator 2 (1991) avatar (2009) Great demand for 3D computing power lead to the development of highly specialized hardware – the GPU. applications &algorithms some selected applications cortical network simulator
(speep up: 80x) pde solver
(speep up: 38x) flow in porous media
(speep up: 100x) realtime fluid and particle simulation
(speep up: 40x) fluid dynamics
medical imaging
neural networks
protein folding
signal processing
linear algebra
monte carlo
pattern recognition
... gravitational n-body simulation games became more and more realistic
physical behaviour became important (e.g. fluids, hanging clothing, dropping bodies, ...)
physical engines have been moved from CPU to GPU reason how it's done 3d graphics rendering / ray tracing is an inherently parallel problem. physics is inherently parallel
→ simulation can benefit a lot from parallelization NVIDIA ATI supported by modern Nvidia GPUs similiar to CUDA much smaller community limited to Windows mainly targeting the games market (physical engines) open
language available for
Windows developed by:
Khronos Group
who does it work?
specialized C-dialect is compiled at runtime
decision over targeting device (CPU/GPU) is made at runtime Microsoft openGL shader for simulating the complex Ginzburg-Landau-Eq. Folding@home project by Stanford University
(peak performance so far ~1 PetaFlop) compile with "g++ -lOpenCL hello_opencl.c" this talk: http://prezi.com/t5lm7cmxpv90/ stl port: thrust lib (CUDA-code is encapsulated away)
Full transcript