Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Gaspard2 and OpenCL

Presentation on Paris VI at 29 avril

Wendell Rodrigues

on 11 May 2010

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Gaspard2 and OpenCL

Gaspard2 and OpenCL Wendell Rodrigues, PhD Student
DaRT Team - USTL -LIFL - INRIA Lille
M. Jean-Luc Dekeyser and M.Frederic Guyomarc'h Region Nord-Pas de Calais - Valeo - GPUTech Part1: What's GPU? Vector Processor: Cray read instruction and decode it
fetch these 10 numbers
fetch those 10 numbers
fetch another 10 numbers
add and multiply them
put the results here 1980/1990 Connection Machines: CM-5 It is a "massively parallel" hypercubic arrangement of thousands of microprocessors, each with its own 4 kbits of RAM, which together executed in a SIMD fashion GPU: Graphics Processing Unit Current GPUs are both a vector machine and massively parallel
More than SIMD, they are SPMD
But they're always a co-processor
Programming Languages: Shader Languages (Sh, Cg, etc.); Brook+(CAL); CUDA, OpenCL (NVidia and AMD) OpenCL (Computing Language) An opened Standard managed by the Khronos group
– Influenced & guided by Apple
– Spec 1.0 approved Dec’08
A system for executing short "Enhanced C" routines (kernels) across devices
– All around Heterogeneous Platforms – Host & Devices
– Devices: CPU, GPU, Accelerator (FPGA)
Skewed towards GPU HW
– Samples, Vector Types, etc.
Offers hybrid execution capability Apple, NVidia, AMD, Intel, IBM, RapidMind, Electronic Arts
(EA), 3DLABS, Activision Blizzard, ARM, Barco, Broadcom,
Codeplay, Ericsson, Freescale, HI, Imagination Technologies,
Kestrel Institute, Motorola, Movidia, Nokia, QNX, Samsung,
Seaweed, Takumi, TI and Umeå University http://gpumodeling.blogspot.com NVidia's CUDA was a disruptive technology
writes C on the GPU
extends to non-traditional usages
provides synchronization mechanism

OpenCL deepens and extends the revolution Refactored GPGPU void sumvec(float *A, float *B,
float *C)
for (i=0;i<N;i++)
C[i] = A[i] + B[i];
} __kernel void sumvec(__global const float *A,
__global const float *B,
__global float *C)
int gid = get_global_id(0);
C[gid] = A[gid] + B[gid];
} Simple CPU Sequential C Version Simple CPU Parallel OCL Version OpenCL: Work-Groups and Work-items Linear Algebra and OpenCL There are no BLAS, FFT, etc. libraries for OpenCL (not yet!)
Make yourself the functions you need
Our problem: Conjugate Gradient with Sparse Matrices (order of 500k or more elements)

Conjugate Gradient Outline GPU/OpenCL/Linear Algebra Gaspard2 Environment CPU controls the main loop and scalar operations
Many kernel launches in the main loop ~2 Gflops Implementation Details SpMV
Scalar and Loop Control Part 2 - Purpose
Computation Model: ArrayOL
Gaspard2 and OpenCL Example Modeling Example (Array Product) Application Model
Architecture Model
Association Model
Deployment Model
QVT Transformations and Metamodels
(add, delete and transform elements)
Acceleo Model to Text Matrix Vector Multiplication Sparse, Square, Symmetric, CSR format
~449.000 elements, N=30880
CSR format CL Code: __kernel void SpMV(
__global const unsigned int* iT,
__global const unsigned int* jT,
__global const float* T,
__global const float* U,
__global float* V,
__global int iNumElements)
// get index into global data array
int iGID = get_global_id(0);
float sum=0;

// bound check (equivalent to the limit on a 'for' loop for standard/serial C code
if (iGID < iNumElements)
for (int i=iT[iGID]; i<iT[iGID+1]; i++)
sum += T[i-1]*U[jT[i-1]-1];
} else return;
} SAXPY BLAS Level 1 AXPY is a combination of scalar
multiplication and vector addition
Performs y ← αx + y , where α is a scalar and y,x are
for (int i = 0; i < n; i++) do
y[i] = a*x[i] + y[i];
end for
__kernel void sAXPY(
__global const unsigned int N,
__global const float ALPHA,
__global const float* X,
__global float* Y){
//get index into global data array
unsigned int i = get_global_id(0);
if (i < N)
Y[i] = ALPHA*X[i] + Y[i];
Dot Product BLAS Level 1 DOT, also known as the scalar product, is an
operation which takes two vectors over the real numbers
and returns a real-valued scalar.
x.y = Σ xi yi = x1 y1 + x2 y2 + ... + xn yn for (int i = 0; i < n; i++) do
dot += x[i]*y[i];
end for
__kernel void sDOT(
__global const unsigned int N,
__global const float* X,
__global const float* Y,
__global float* DOT,
__local float* SharedData) {

//get global and local index
unsigned int i = get_global_id(0);
unsigned int tid = get_local_id(0);
SharedData[tid] = (i<N) ? X[i]*Y[i] : 0; //operates in local memory
//do reduction in local shared mem
for(int s=1; s < get_local_size(0); s *= 2) {
int index = 2 * s * tid;
if (index < get_local_size(0))
SharedData[index] += SharedData[index + s];
//first thread of each work-group writes on global memory
if (tid == 0) DOT[get_group_id(0)] = SharedData[0];
} Barrier works only for the work-items from the same work-group
Max work group size: 1024
Vector size = 30880?
a) another step to compute the final sum
problem: there's no SYNCHRONIZATION before kernel ends
b) Host makes the final reduction
30880=~30 adds
problem: You can't make a single kernel for the entire loop Dot Product (Atomic Functions) #pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable __kernel void sDOT(
__global const unsigned int N,
__global const float* X,
__global const float* Y,
__global float* DOT,
__global int* FLAG,
__local float* sdata) {
// get index into global data array
unsigned int tid = get_local_id(0);
unsigned int i = get_global_id(0);

sdata[tid] = (i<N) ? X[i]*Y[i] : 0;

if (i==0) {


// do reduction in shared mem
for(unsigned int s=1; s < get_local_size(0); s *= 2)
int index = 2 * s * tid;

if (index < get_local_size(0))
sdata[index] += sdata[index + s];


// write result for this block to global mem
if (tid == 0) {
while (atom_cmpxchg(FLAG,0,1)==1);
DOT[0] += sdata[0];

3x less speed
no performance Device Info:
Extensions: cl_khr_icd cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics
cl_khr_int64_extended_atomics cl_khr_byte_addressable_store Gaspard2 Purpose An environment that allows to implement the MDE (Model Driven Engineering) for RTES(Real-Time Embedded Systems) conception.
model reuse
human-friendly models
promotes parallelism to help reduce the power consumption and improving performance.
different target platforms
opensource and free (DaRT Team/INRIA Lille)
Gaspard2 Eclipse Environment Gaspard2 MDE/MDA Model Driven Architecture (MDA)
Unified Modeling Language (UML)
Meta Object Facility (MOF)
XML Metadata Interchange (XMI)
Modeling and Analysis of Real-time and Embedded systems (MARTE)
Query / View / Transformation (QVT) Gaspard2 and MARTE Array-OL: Array Oriented Language
Specification language for signal processing applications
Based on a multidimensional arrays
Allows to express the whole potential parallelism of applications Potential Parallelism Expression of Application
task parallelism
data parallelism APPLICATION MODELING Scheduling Task Graph: v1//v2, k, r1
Kernel Graph: op1, op2 Gaspard2 Environment Application Papyrus UML/Eclipse
MARTE Profile (OMG)
ArrayOL Concepts (tiler, reshape, etc) Tiler Specification
origin: {0}
paving: {{1}}
fitting: {{}} Modeling in UML + MARTE Gaspard2 Task Parallelism Task Graph:
blocks: tasks
connectors: data dependency Gaspard2 Data Parallelism Task Repetition
pattern producers and consumers (0,0) (1,0) (0,1) ARCHITECTURE MODELING GPU Architecture Tagged values: host and device
Communication Bus: allows to know how to transfer data
Memory Hierarchy: tagged values (private, local, constant, global) ASSOCIATION DEPLOYMENT At this moment, the user should explicit:
where and how the tasks will be executed
where and how the ports (memory allocations) will be placed
Constant, Global, Local (Shared) or Private??? At this moment, the user should indicate:
IP (Intelectual Property): library, piece of code, binary
OpenCL: kernel functions
Each ELEMENTARY TASK has an IP Elementary Task /********************
Gaspard2 MDE
IP name: MultInstance
function name: mult
Parameters: a,b,c
//OpenCL Functions
void mult(
__global const float* a,
__global const float* b,
__global float* c)
} Transformation Chain Transformation Library Gaspard2 Transformation Engine Scheduling Memory Mapping Task Graph Tiler to Task Polyhedron Loop GPUAPI Functional Model Load
(uml file) UML to MARTE Tiler to Task Local and Global Task Graph Scheduling Instances of Ports Memory Mapping GPUAPI Model to Text (Acceleo) Code Generated
.cl, .cpp, Makefile Gaspard2 provides a transformation chain tool to execute different transformations.

At last, with a template Acceleo the engine generates the code files. mapping Allocate::toKernel() : gpuapi::GPUAPI::Kernel {
init {
var kern := self.source->selectOne(true).oclAsType(AssemblyPart); //mapped to GPU
var ntasks := kern.type[StructuredComponent].ownedProperties[AssemblyPart]
->select(type[LinkTopologyTask]->isEmpty()).shape->shapeprod()->maxshape(); //threads count
name := kern.name;
assemblyPart := kern;
launchtopology := object gpuapi::GPUAPI::LaunchTopology{ //topology x,y,z = (int)total/256,256,0
dim := (ntasks/256 + 0.5).round();
dim += 256;
dim += 0;

// create functions (not tiler task) for each kernel
functions += kern.type[StructuredComponent].ownedProperties[AssemblyPart]->select(type[TilerTask]->isEmpty())
->map toIPFunctions(gpuapi::GPUAPI::placingOptions::Device);
variables += kern.ports[PortPart]->map FuncVars(gpuapi::GPUAPI::placingOptions::Device);
} QVTO (Query/View/Transformation Operational) : bit of code CODE GENERATED The transformation chain is hidden from the designer
Click and Generate
This example shows an OpenCL code, but the MetaModel GPUAPI allows to generate CUDA as well.
changes in Acceleo Model Templates Final Overview GPUs are good for linear algebra (low cost, good performance)
Right now, Gaspard2 is able to generate VHDL, OpenMP C and Fortran, SystemC, Synchronous
My work concentrates on the new chain, the OpenCL chain
The OpenCL chain beta version is almost ready
Future compiler versions will provide optimization levels: Memory copies
Work-Groups and Work-Items Dimension
Memory Access Coalescing Electric Machines Simulation
Maxwell's equations
linear systems
We need a linear solver (iterative, direct, etc.) Remarks Loop Control
switching between host and device
data remain in GPU memory Writing code
if we don't need to write the code...
a tool to generate OpenCL Remarks process all the tilers in the model
generates tasks and functions based on the Task Graph
assigns one possible scheduling discipline create Memory Mappings
create data allocations based on flowports searches associated tasks
defines kernels, variables, variable links (copyHostToDevice)
launch topology
Full transcript