### Present Remotely

Send the link below via email or IM

CopyPresent to your audience

Start remote presentation- Invited audience members
**will follow you**as you navigate and present - People invited to a presentation
**do not need a Prezi account** - This link expires
**10 minutes**after you close the presentation - A maximum of
**30 users**can follow your presentation - Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

### Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.

You can change this under Settings & Account at any time.

# Gaspard2 and OpenCL

Presentation on Paris VI at 29 avril

by

Tweet## Wendell Rodrigues

on 11 May 2010#### Transcript of Gaspard2 and OpenCL

Gaspard2 and OpenCL Wendell Rodrigues, PhD Student

DaRT Team - USTL -LIFL - INRIA Lille

M. Jean-Luc Dekeyser and M.Frederic Guyomarc'h Region Nord-Pas de Calais - Valeo - GPUTech Part1: What's GPU? Vector Processor: Cray read instruction and decode it

fetch these 10 numbers

fetch those 10 numbers

fetch another 10 numbers

add and multiply them

put the results here 1980/1990 Connection Machines: CM-5 It is a "massively parallel" hypercubic arrangement of thousands of microprocessors, each with its own 4 kbits of RAM, which together executed in a SIMD fashion GPU: Graphics Processing Unit Current GPUs are both a vector machine and massively parallel

More than SIMD, they are SPMD

But they're always a co-processor

Programming Languages: Shader Languages (Sh, Cg, etc.); Brook+(CAL); CUDA, OpenCL (NVidia and AMD) OpenCL (Computing Language) An opened Standard managed by the Khronos group

– Influenced & guided by Apple

– Spec 1.0 approved Dec’08

A system for executing short "Enhanced C" routines (kernels) across devices

– All around Heterogeneous Platforms – Host & Devices

– Devices: CPU, GPU, Accelerator (FPGA)

Skewed towards GPU HW

– Samples, Vector Types, etc.

Offers hybrid execution capability Apple, NVidia, AMD, Intel, IBM, RapidMind, Electronic Arts

(EA), 3DLABS, Activision Blizzard, ARM, Barco, Broadcom,

Codeplay, Ericsson, Freescale, HI, Imagination Technologies,

Kestrel Institute, Motorola, Movidia, Nokia, QNX, Samsung,

Seaweed, Takumi, TI and Umeå University http://gpumodeling.blogspot.com NVidia's CUDA was a disruptive technology

writes C on the GPU

extends to non-traditional usages

provides synchronization mechanism

OpenCL deepens and extends the revolution Refactored GPGPU void sumvec(float *A, float *B,

float *C)

{

for (i=0;i<N;i++)

C[i] = A[i] + B[i];

} __kernel void sumvec(__global const float *A,

__global const float *B,

__global float *C)

{

int gid = get_global_id(0);

C[gid] = A[gid] + B[gid];

} Simple CPU Sequential C Version Simple CPU Parallel OCL Version OpenCL: Work-Groups and Work-items Linear Algebra and OpenCL There are no BLAS, FFT, etc. libraries for OpenCL (not yet!)

Make yourself the functions you need

Our problem: Conjugate Gradient with Sparse Matrices (order of 500k or more elements)

Implementing...

Conjugate Gradient Outline GPU/OpenCL/Linear Algebra Gaspard2 Environment CPU controls the main loop and scalar operations

Many kernel launches in the main loop ~2 Gflops Implementation Details SpMV

AXPY

DotProduct

Scalar and Loop Control Part 2 - Purpose

Environment

MDE

Computation Model: ArrayOL

Gaspard2 and OpenCL Example Modeling Example (Array Product) Application Model

Architecture Model

Association Model

Deployment Model

QVT Transformations and Metamodels

(add, delete and transform elements)

Acceleo Model to Text Matrix Vector Multiplication Sparse, Square, Symmetric, CSR format

~449.000 elements, N=30880

CSR format CL Code: __kernel void SpMV(

__global const unsigned int* iT,

__global const unsigned int* jT,

__global const float* T,

__global const float* U,

__global float* V,

__global int iNumElements)

{

// get index into global data array

int iGID = get_global_id(0);

float sum=0;

// bound check (equivalent to the limit on a 'for' loop for standard/serial C code

if (iGID < iNumElements)

{

for (int i=iT[iGID]; i<iT[iGID+1]; i++)

sum += T[i-1]*U[jT[i-1]-1];

} else return;

V[iGID]=sum;

} SAXPY BLAS Level 1 AXPY is a combination of scalar

multiplication and vector addition

Performs y ← αx + y , where α is a scalar and y,x are

vectors

for (int i = 0; i < n; i++) do

y[i] = a*x[i] + y[i];

end for

__kernel void sAXPY(

__global const unsigned int N,

__global const float ALPHA,

__global const float* X,

__global float* Y){

//get index into global data array

unsigned int i = get_global_id(0);

if (i < N)

Y[i] = ALPHA*X[i] + Y[i];

}

Dot Product BLAS Level 1 DOT, also known as the scalar product, is an

operation which takes two vectors over the real numbers

and returns a real-valued scalar.

x.y = Σ xi yi = x1 y1 + x2 y2 + ... + xn yn for (int i = 0; i < n; i++) do

dot += x[i]*y[i];

end for

__kernel void sDOT(

__global const unsigned int N,

__global const float* X,

__global const float* Y,

__global float* DOT,

__local float* SharedData) {

//get global and local index

unsigned int i = get_global_id(0);

unsigned int tid = get_local_id(0);

SharedData[tid] = (i<N) ? X[i]*Y[i] : 0; //operates in local memory

barrier(CLK_LOCAL_MEM_FENCE);

//do reduction in local shared mem

for(int s=1; s < get_local_size(0); s *= 2) {

int index = 2 * s * tid;

if (index < get_local_size(0))

SharedData[index] += SharedData[index + s];

barrier(CLK_LOCAL_MEM_FENCE);

}

//first thread of each work-group writes on global memory

if (tid == 0) DOT[get_group_id(0)] = SharedData[0];

} Barrier works only for the work-items from the same work-group

Max work group size: 1024

Vector size = 30880?

Solutions?

a) another step to compute the final sum

problem: there's no SYNCHRONIZATION before kernel ends

b) Host makes the final reduction

30880=~30 adds

problem: You can't make a single kernel for the entire loop Dot Product (Atomic Functions) #pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable

#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable __kernel void sDOT(

__global const unsigned int N,

__global const float* X,

__global const float* Y,

__global float* DOT,

__global int* FLAG,

__local float* sdata) {

// get index into global data array

unsigned int tid = get_local_id(0);

unsigned int i = get_global_id(0);

sdata[tid] = (i<N) ? X[i]*Y[i] : 0;

if (i==0) {

DOT[0]=0;

*FLAG=0;

}

barrier(CLK_LOCAL_MEM_FENCE);

// do reduction in shared mem

for(unsigned int s=1; s < get_local_size(0); s *= 2)

{

int index = 2 * s * tid;

if (index < get_local_size(0))

{

sdata[index] += sdata[index + s];

}

barrier(CLK_LOCAL_MEM_FENCE);

}

// write result for this block to global mem

if (tid == 0) {

while (atom_cmpxchg(FLAG,0,1)==1);

DOT[0] += sdata[0];

atom_cmpxchg(FLAG,1,0);

}

}

Result:

3x less speed

no performance Device Info:

Extensions: cl_khr_icd cl_khr_global_int32_base_atomics

cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics

cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics

cl_khr_int64_extended_atomics cl_khr_byte_addressable_store Gaspard2 Purpose An environment that allows to implement the MDE (Model Driven Engineering) for RTES(Real-Time Embedded Systems) conception.

model reuse

human-friendly models

promotes parallelism to help reduce the power consumption and improving performance.

different target platforms

opensource and free (DaRT Team/INRIA Lille)

Gaspard2 Eclipse Environment Gaspard2 MDE/MDA Model Driven Architecture (MDA)

Unified Modeling Language (UML)

Meta Object Facility (MOF)

XML Metadata Interchange (XMI)

Modeling and Analysis of Real-time and Embedded systems (MARTE)

Query / View / Transformation (QVT) Gaspard2 and MARTE Array-OL: Array Oriented Language

Specification language for signal processing applications

Based on a multidimensional arrays

Allows to express the whole potential parallelism of applications Potential Parallelism Expression of Application

task parallelism

data parallelism APPLICATION MODELING Scheduling Task Graph: v1//v2, k, r1

Kernel Graph: op1, op2 Gaspard2 Environment Application Papyrus UML/Eclipse

MARTE Profile (OMG)

ArrayOL Concepts (tiler, reshape, etc) Tiler Specification

origin: {0}

paving: {{1}}

fitting: {{}} Modeling in UML + MARTE Gaspard2 Task Parallelism Task Graph:

blocks: tasks

connectors: data dependency Gaspard2 Data Parallelism Task Repetition

pattern producers and consumers (0,0) (1,0) (0,1) ARCHITECTURE MODELING GPU Architecture Tagged values: host and device

Communication Bus: allows to know how to transfer data

Memory Hierarchy: tagged values (private, local, constant, global) ASSOCIATION DEPLOYMENT At this moment, the user should explicit:

where and how the tasks will be executed

where and how the ports (memory allocations) will be placed

Constant, Global, Local (Shared) or Private??? At this moment, the user should indicate:

IP (Intelectual Property): library, piece of code, binary

OpenCL: kernel functions

Each ELEMENTARY TASK has an IP Elementary Task /********************

Gaspard2 MDE

IP name: MultInstance

function name: mult

Parameters: a,b,c

*********************/

//OpenCL Functions

void mult(

__global const float* a,

__global const float* b,

__global float* c)

{

c[0]=a[0]*b[0];

} Transformation Chain Transformation Library Gaspard2 Transformation Engine Scheduling Memory Mapping Task Graph Tiler to Task Polyhedron Loop GPUAPI Functional Model Load

(uml file) UML to MARTE Tiler to Task Local and Global Task Graph Scheduling Instances of Ports Memory Mapping GPUAPI Model to Text (Acceleo) Code Generated

/src/

.cl, .cpp, Makefile Gaspard2 provides a transformation chain tool to execute different transformations.

At last, with a template Acceleo the engine generates the code files. mapping Allocate::toKernel() : gpuapi::GPUAPI::Kernel {

init {

var kern := self.source->selectOne(true).oclAsType(AssemblyPart); //mapped to GPU

var ntasks := kern.type[StructuredComponent].ownedProperties[AssemblyPart]

->select(type[LinkTopologyTask]->isEmpty()).shape->shapeprod()->maxshape(); //threads count

}

name := kern.name;

assemblyPart := kern;

launchtopology := object gpuapi::GPUAPI::LaunchTopology{ //topology x,y,z = (int)total/256,256,0

dim := (ntasks/256 + 0.5).round();

dim += 256;

dim += 0;

};

// create functions (not tiler task) for each kernel

functions += kern.type[StructuredComponent].ownedProperties[AssemblyPart]->select(type[TilerTask]->isEmpty())

->map toIPFunctions(gpuapi::GPUAPI::placingOptions::Device);

//vars

variables += kern.ports[PortPart]->map FuncVars(gpuapi::GPUAPI::placingOptions::Device);

} QVTO (Query/View/Transformation Operational) : bit of code CODE GENERATED The transformation chain is hidden from the designer

Click and Generate

This example shows an OpenCL code, but the MetaModel GPUAPI allows to generate CUDA as well.

changes in Acceleo Model Templates Final Overview GPUs are good for linear algebra (low cost, good performance)

Right now, Gaspard2 is able to generate VHDL, OpenMP C and Fortran, SystemC, Synchronous

My work concentrates on the new chain, the OpenCL chain

The OpenCL chain beta version is almost ready

Future compiler versions will provide optimization levels: Memory copies

Work-Groups and Work-Items Dimension

Memory Access Coalescing Electric Machines Simulation

Maxwell's equations

linear systems

We need a linear solver (iterative, direct, etc.) Remarks Loop Control

switching between host and device

data remain in GPU memory Writing code

if we don't need to write the code...

a tool to generate OpenCL Remarks process all the tilers in the model

generates tasks and functions based on the Task Graph

assigns one possible scheduling discipline create Memory Mappings

create data allocations based on flowports searches associated tasks

defines kernels, variables, variable links (copyHostToDevice)

launch topology

Full transcriptDaRT Team - USTL -LIFL - INRIA Lille

M. Jean-Luc Dekeyser and M.Frederic Guyomarc'h Region Nord-Pas de Calais - Valeo - GPUTech Part1: What's GPU? Vector Processor: Cray read instruction and decode it

fetch these 10 numbers

fetch those 10 numbers

fetch another 10 numbers

add and multiply them

put the results here 1980/1990 Connection Machines: CM-5 It is a "massively parallel" hypercubic arrangement of thousands of microprocessors, each with its own 4 kbits of RAM, which together executed in a SIMD fashion GPU: Graphics Processing Unit Current GPUs are both a vector machine and massively parallel

More than SIMD, they are SPMD

But they're always a co-processor

Programming Languages: Shader Languages (Sh, Cg, etc.); Brook+(CAL); CUDA, OpenCL (NVidia and AMD) OpenCL (Computing Language) An opened Standard managed by the Khronos group

– Influenced & guided by Apple

– Spec 1.0 approved Dec’08

A system for executing short "Enhanced C" routines (kernels) across devices

– All around Heterogeneous Platforms – Host & Devices

– Devices: CPU, GPU, Accelerator (FPGA)

Skewed towards GPU HW

– Samples, Vector Types, etc.

Offers hybrid execution capability Apple, NVidia, AMD, Intel, IBM, RapidMind, Electronic Arts

(EA), 3DLABS, Activision Blizzard, ARM, Barco, Broadcom,

Codeplay, Ericsson, Freescale, HI, Imagination Technologies,

Kestrel Institute, Motorola, Movidia, Nokia, QNX, Samsung,

Seaweed, Takumi, TI and Umeå University http://gpumodeling.blogspot.com NVidia's CUDA was a disruptive technology

writes C on the GPU

extends to non-traditional usages

provides synchronization mechanism

OpenCL deepens and extends the revolution Refactored GPGPU void sumvec(float *A, float *B,

float *C)

{

for (i=0;i<N;i++)

C[i] = A[i] + B[i];

} __kernel void sumvec(__global const float *A,

__global const float *B,

__global float *C)

{

int gid = get_global_id(0);

C[gid] = A[gid] + B[gid];

} Simple CPU Sequential C Version Simple CPU Parallel OCL Version OpenCL: Work-Groups and Work-items Linear Algebra and OpenCL There are no BLAS, FFT, etc. libraries for OpenCL (not yet!)

Make yourself the functions you need

Our problem: Conjugate Gradient with Sparse Matrices (order of 500k or more elements)

Implementing...

Conjugate Gradient Outline GPU/OpenCL/Linear Algebra Gaspard2 Environment CPU controls the main loop and scalar operations

Many kernel launches in the main loop ~2 Gflops Implementation Details SpMV

AXPY

DotProduct

Scalar and Loop Control Part 2 - Purpose

Environment

MDE

Computation Model: ArrayOL

Gaspard2 and OpenCL Example Modeling Example (Array Product) Application Model

Architecture Model

Association Model

Deployment Model

QVT Transformations and Metamodels

(add, delete and transform elements)

Acceleo Model to Text Matrix Vector Multiplication Sparse, Square, Symmetric, CSR format

~449.000 elements, N=30880

CSR format CL Code: __kernel void SpMV(

__global const unsigned int* iT,

__global const unsigned int* jT,

__global const float* T,

__global const float* U,

__global float* V,

__global int iNumElements)

{

// get index into global data array

int iGID = get_global_id(0);

float sum=0;

// bound check (equivalent to the limit on a 'for' loop for standard/serial C code

if (iGID < iNumElements)

{

for (int i=iT[iGID]; i<iT[iGID+1]; i++)

sum += T[i-1]*U[jT[i-1]-1];

} else return;

V[iGID]=sum;

} SAXPY BLAS Level 1 AXPY is a combination of scalar

multiplication and vector addition

Performs y ← αx + y , where α is a scalar and y,x are

vectors

for (int i = 0; i < n; i++) do

y[i] = a*x[i] + y[i];

end for

__kernel void sAXPY(

__global const unsigned int N,

__global const float ALPHA,

__global const float* X,

__global float* Y){

//get index into global data array

unsigned int i = get_global_id(0);

if (i < N)

Y[i] = ALPHA*X[i] + Y[i];

}

Dot Product BLAS Level 1 DOT, also known as the scalar product, is an

operation which takes two vectors over the real numbers

and returns a real-valued scalar.

x.y = Σ xi yi = x1 y1 + x2 y2 + ... + xn yn for (int i = 0; i < n; i++) do

dot += x[i]*y[i];

end for

__kernel void sDOT(

__global const unsigned int N,

__global const float* X,

__global const float* Y,

__global float* DOT,

__local float* SharedData) {

//get global and local index

unsigned int i = get_global_id(0);

unsigned int tid = get_local_id(0);

SharedData[tid] = (i<N) ? X[i]*Y[i] : 0; //operates in local memory

barrier(CLK_LOCAL_MEM_FENCE);

//do reduction in local shared mem

for(int s=1; s < get_local_size(0); s *= 2) {

int index = 2 * s * tid;

if (index < get_local_size(0))

SharedData[index] += SharedData[index + s];

barrier(CLK_LOCAL_MEM_FENCE);

}

//first thread of each work-group writes on global memory

if (tid == 0) DOT[get_group_id(0)] = SharedData[0];

} Barrier works only for the work-items from the same work-group

Max work group size: 1024

Vector size = 30880?

Solutions?

a) another step to compute the final sum

problem: there's no SYNCHRONIZATION before kernel ends

b) Host makes the final reduction

30880=~30 adds

problem: You can't make a single kernel for the entire loop Dot Product (Atomic Functions) #pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable

#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable __kernel void sDOT(

__global const unsigned int N,

__global const float* X,

__global const float* Y,

__global float* DOT,

__global int* FLAG,

__local float* sdata) {

// get index into global data array

unsigned int tid = get_local_id(0);

unsigned int i = get_global_id(0);

sdata[tid] = (i<N) ? X[i]*Y[i] : 0;

if (i==0) {

DOT[0]=0;

*FLAG=0;

}

barrier(CLK_LOCAL_MEM_FENCE);

// do reduction in shared mem

for(unsigned int s=1; s < get_local_size(0); s *= 2)

{

int index = 2 * s * tid;

if (index < get_local_size(0))

{

sdata[index] += sdata[index + s];

}

barrier(CLK_LOCAL_MEM_FENCE);

}

// write result for this block to global mem

if (tid == 0) {

while (atom_cmpxchg(FLAG,0,1)==1);

DOT[0] += sdata[0];

atom_cmpxchg(FLAG,1,0);

}

}

Result:

3x less speed

no performance Device Info:

Extensions: cl_khr_icd cl_khr_global_int32_base_atomics

cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics

cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics

cl_khr_int64_extended_atomics cl_khr_byte_addressable_store Gaspard2 Purpose An environment that allows to implement the MDE (Model Driven Engineering) for RTES(Real-Time Embedded Systems) conception.

model reuse

human-friendly models

promotes parallelism to help reduce the power consumption and improving performance.

different target platforms

opensource and free (DaRT Team/INRIA Lille)

Gaspard2 Eclipse Environment Gaspard2 MDE/MDA Model Driven Architecture (MDA)

Unified Modeling Language (UML)

Meta Object Facility (MOF)

XML Metadata Interchange (XMI)

Modeling and Analysis of Real-time and Embedded systems (MARTE)

Query / View / Transformation (QVT) Gaspard2 and MARTE Array-OL: Array Oriented Language

Specification language for signal processing applications

Based on a multidimensional arrays

Allows to express the whole potential parallelism of applications Potential Parallelism Expression of Application

task parallelism

data parallelism APPLICATION MODELING Scheduling Task Graph: v1//v2, k, r1

Kernel Graph: op1, op2 Gaspard2 Environment Application Papyrus UML/Eclipse

MARTE Profile (OMG)

ArrayOL Concepts (tiler, reshape, etc) Tiler Specification

origin: {0}

paving: {{1}}

fitting: {{}} Modeling in UML + MARTE Gaspard2 Task Parallelism Task Graph:

blocks: tasks

connectors: data dependency Gaspard2 Data Parallelism Task Repetition

pattern producers and consumers (0,0) (1,0) (0,1) ARCHITECTURE MODELING GPU Architecture Tagged values: host and device

Communication Bus: allows to know how to transfer data

Memory Hierarchy: tagged values (private, local, constant, global) ASSOCIATION DEPLOYMENT At this moment, the user should explicit:

where and how the tasks will be executed

where and how the ports (memory allocations) will be placed

Constant, Global, Local (Shared) or Private??? At this moment, the user should indicate:

IP (Intelectual Property): library, piece of code, binary

OpenCL: kernel functions

Each ELEMENTARY TASK has an IP Elementary Task /********************

Gaspard2 MDE

IP name: MultInstance

function name: mult

Parameters: a,b,c

*********************/

//OpenCL Functions

void mult(

__global const float* a,

__global const float* b,

__global float* c)

{

c[0]=a[0]*b[0];

} Transformation Chain Transformation Library Gaspard2 Transformation Engine Scheduling Memory Mapping Task Graph Tiler to Task Polyhedron Loop GPUAPI Functional Model Load

(uml file) UML to MARTE Tiler to Task Local and Global Task Graph Scheduling Instances of Ports Memory Mapping GPUAPI Model to Text (Acceleo) Code Generated

/src/

.cl, .cpp, Makefile Gaspard2 provides a transformation chain tool to execute different transformations.

At last, with a template Acceleo the engine generates the code files. mapping Allocate::toKernel() : gpuapi::GPUAPI::Kernel {

init {

var kern := self.source->selectOne(true).oclAsType(AssemblyPart); //mapped to GPU

var ntasks := kern.type[StructuredComponent].ownedProperties[AssemblyPart]

->select(type[LinkTopologyTask]->isEmpty()).shape->shapeprod()->maxshape(); //threads count

}

name := kern.name;

assemblyPart := kern;

launchtopology := object gpuapi::GPUAPI::LaunchTopology{ //topology x,y,z = (int)total/256,256,0

dim := (ntasks/256 + 0.5).round();

dim += 256;

dim += 0;

};

// create functions (not tiler task) for each kernel

functions += kern.type[StructuredComponent].ownedProperties[AssemblyPart]->select(type[TilerTask]->isEmpty())

->map toIPFunctions(gpuapi::GPUAPI::placingOptions::Device);

//vars

variables += kern.ports[PortPart]->map FuncVars(gpuapi::GPUAPI::placingOptions::Device);

} QVTO (Query/View/Transformation Operational) : bit of code CODE GENERATED The transformation chain is hidden from the designer

Click and Generate

This example shows an OpenCL code, but the MetaModel GPUAPI allows to generate CUDA as well.

changes in Acceleo Model Templates Final Overview GPUs are good for linear algebra (low cost, good performance)

Right now, Gaspard2 is able to generate VHDL, OpenMP C and Fortran, SystemC, Synchronous

My work concentrates on the new chain, the OpenCL chain

The OpenCL chain beta version is almost ready

Future compiler versions will provide optimization levels: Memory copies

Work-Groups and Work-Items Dimension

Memory Access Coalescing Electric Machines Simulation

Maxwell's equations

linear systems

We need a linear solver (iterative, direct, etc.) Remarks Loop Control

switching between host and device

data remain in GPU memory Writing code

if we don't need to write the code...

a tool to generate OpenCL Remarks process all the tilers in the model

generates tasks and functions based on the Task Graph

assigns one possible scheduling discipline create Memory Mappings

create data allocations based on flowports searches associated tasks

defines kernels, variables, variable links (copyHostToDevice)

launch topology