Loading…

Gaspard2 and OpenCL

Wendell Rodrigues

Updated May 11, 2010

Transcript

__kernel void SpMV(

__global const unsigned int* iT,

__global const unsigned int* jT,

__global const float* T,

__global const float* U,

__global float* V,

__global int iNumElements)

{

// get index into global data array

int iGID = get_global_id(0);

float sum=0;

// bound check (equivalent to the limit on a 'for' loop for standard/serial C code

if (iGID < iNumElements)

{

for (int i=iT[iGID]; i<iT[iGID+1]; i++)

sum += T[i-1]*U[jT[i-1]-1];

} else return;

V[iGID]=sum;

}

__kernel void sAXPY(

__global const unsigned int N,

__global const float ALPHA,

__global const float* X,

__global float* Y){

//get index into global data array

unsigned int i = get_global_id(0);

if (i < N)

Y[i] = ALPHA*X[i] + Y[i];

}

Device Info:

Extensions: cl_khr_icd cl_khr_global_int32_base_atomics

cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics

cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics

cl_khr_int64_extended_atomics cl_khr_byte_addressable_store

__kernel void sDOT(

__global const unsigned int N,

__global const float* X,

__global const float* Y,

__global float* DOT,

__global int* FLAG,

__local float* sdata) {

// get index into global data array

unsigned int tid = get_local_id(0);

unsigned int i = get_global_id(0);

sdata[tid] = (i<N) ? X[i]*Y[i] : 0;

if (i==0) {

DOT[0]=0;

*FLAG=0;

}

barrier(CLK_LOCAL_MEM_FENCE);

// do reduction in shared mem

for(unsigned int s=1; s < get_local_size(0); s *= 2)

{

int index = 2 * s * tid;

if (index < get_local_size(0))

{

sdata[index] += sdata[index + s];

}

barrier(CLK_LOCAL_MEM_FENCE);

}

// write result for this block to global mem

if (tid == 0) {

while (atom_cmpxchg(FLAG,0,1)==1);

DOT[0] += sdata[0];

atom_cmpxchg(FLAG,1,0);

}

/********************

Gaspard2 MDE

IP name: MultInstance

function name: mult

Parameters: a,b,c

*********************/

//OpenCL Functions

void mult(

__global const float* a,

__global const float* b,

__global float* c)

{

c[0]=a[0]*b[0];

}

Outline

GPU/OpenCL/Linear Algebra

Gaspard2 Environment

Gaspard2 and MARTE

Part 2 -

Part1: What's GPU?

Matrix Vector Multiplication

1980/1990

Gaspard2 and OpenCL

Modeling in UML + MARTE

Sparse, Square, Symmetric, CSR format
~449.000 elements, N=30880
CSR format

Purpose
Environment
MDE
Computation Model: ArrayOL
Gaspard2 and OpenCL Example

Wendell Rodrigues, PhD Student

DaRT Team - USTL -LIFL - INRIA Lille

M. Jean-Luc Dekeyser and M.Frederic Guyomarc'h

Array-OL: Array Oriented Language

Specification language for signal processing applications
Based on a multidimensional arrays
Allows to express the whole potential parallelism of applications

CL Code:

Region Nord-Pas de Calais - Valeo - GPUTech

Potential Parallelism Expression of Application

task parallelism
data parallelism

Vector Processor: Cray

Connection Machines: CM-5

Gaspard2 Environment

read instruction and decode it

fetch these 10 numbers

fetch those 10 numbers

fetch another 10 numbers

add and multiply them

put the results here

It is a "massively parallel" hypercubic arrangement of thousands of microprocessors, each with its own 4 kbits of RAM, which together executed in a SIMD fashion

Gaspard2 Purpose

Final Overview

http://gpumodeling.blogspot.com

Gaspard2 Task Parallelism

APPLICATION MODELING

ARCHITECTURE MODELING

GPU: Graphics Processing Unit

GPU Architecture

Current GPUs are both a vector machine and massively parallel
More than SIMD, they are SPMD
But they're always a co-processor
Programming Languages: Shader Languages (Sh, Cg, etc.); Brook+(CAL); CUDA, OpenCL (NVidia and AMD)

Task Graph:

blocks: tasks
connectors: data dependency

Tagged values: host and device
Communication Bus: allows to know how to transfer data
Memory Hierarchy: tagged values (private, local, constant, global)

SAXPY

An environment that allows to implement the MDE (Model Driven Engineering) for RTES(Real-Time Embedded Systems) conception.

model reuse
human-friendly models
promotes parallelism to help reduce the power consumption and improving performance.
different target platforms
opensource and free (DaRT Team/INRIA Lille)

GPUs are good for linear algebra (low cost, good performance)
Right now, Gaspard2 is able to generate VHDL, OpenMP C and Fortran, SystemC, Synchronous
My work concentrates on the new chain, the OpenCL chain
The OpenCL chain beta version is almost ready

OpenCL (Computing Language)

Scheduling

Application

Apple, NVidia, AMD, Intel, IBM, RapidMind, Electronic Arts

(EA), 3DLABS, Activision Blizzard, ARM, Barco, Broadcom,

Codeplay, Ericsson, Freescale, HI, Imagination Technologies,

Kestrel Institute, Motorola, Movidia, Nokia, QNX, Samsung,

Seaweed, Takumi, TI and Umeå University

Task Graph: v1//v2, k, r1

Kernel Graph: op1, op2

BLAS Level 1 AXPY is a combination of scalar

multiplication and vector addition

Performs y ← αx + y , where α is a scalar and y,x are

vectors

Papyrus UML/Eclipse
MARTE Profile (OMG)
ArrayOL Concepts (tiler, reshape, etc)

Conjugate Gradient

An opened Standard managed by the Khronos group

– Influenced & guided by Apple

– Spec 1.0 approved Dec’08

A system for executing short "Enhanced C" routines (kernels) across devices

– All around Heterogeneous Platforms – Host & Devices

– Devices: CPU, GPU, Accelerator (FPGA)

Skewed towards GPU HW

– Samples, Vector Types, etc.

Offers hybrid execution capability

Tiler Specification

origin: {0}
paving: {{1}}
fitting: {{}}

Future compiler versions will provide optimization levels:

CPU controls the main loop and scalar operations
Many kernel launches in the main loop

for (int i = 0; i < n; i++) do

y[i] = a*x[i] + y[i];

end for

Memory copies

Work-Groups and Work-Items Dimension

Memory Access Coalescing

Gaspard2 Eclipse Environment

~2 Gflops

Gaspard2 Data Parallelism

Task Repetition

pattern producers and consumers

Refactored GPGPU

(0,1)

NVidia's CUDA was a disruptive technology

writes C on the GPU
extends to non-traditional usages
provides synchronization mechanism

OpenCL deepens and extends the revolution

ASSOCIATION

(0,0)

(1,0)

At this moment, the user should explicit:

where and how the tasks will be executed
where and how the ports (memory allocations) will be placed

Constant, Global, Local (Shared) or Private???

Implementation Details

Dot Product

BLAS Level 1 DOT, also known as the scalar product, is an

operation which takes two vectors over the real numbers

and returns a real-valued scalar.

x.y = Σ xi yi = x1 y1 + x2 y2 + ... + xn yn

Simple CPU Sequential C Version

Simple CPU Parallel OCL Version

SpMV
AXPY
DotProduct
Scalar and Loop Control

DEPLOYMENT

At this moment, the user should indicate:

IP (Intelectual Property): library, piece of code, binary

OpenCL: kernel functions

Each ELEMENTARY TASK has an IP

void sumvec(float A, float B,

float *C)

{

for (i=0;i<N;i++)

C[i] = A[i] + B[i];

}

kernel void sumvec(global const float *A,

__global const float *B,

__global float *C)

{

int gid = get_global_id(0);

C[gid] = A[gid] + B[gid];

}

Elementary Task

__kernel void sDOT(

__global const unsigned int N,

__global const float* X,

__global const float* Y,

__global float* DOT,

__local float* SharedData) {

//get global and local index

unsigned int i = get_global_id(0);

unsigned int tid = get_local_id(0);

SharedData[tid] = (i<N) ? X[i]*Y[i] : 0; //operates in local memory

barrier(CLK_LOCAL_MEM_FENCE);

//do reduction in local shared mem

for(int s=1; s < get_local_size(0); s *= 2) {

int index = 2 * s * tid;

if (index < get_local_size(0))

SharedData[index] += SharedData[index + s];

barrier(CLK_LOCAL_MEM_FENCE);

}

//first thread of each work-group writes on global memory

if (tid == 0) DOT[get_group_id(0)] = SharedData[0];

}

for (int i = 0; i < n; i++) do

dot += x[i]*y[i];

end for

Transformation Chain

Barrier works only for the work-items from the same work-group
Max work group size: 1024
Vector size = 30880?
Solutions?

a) another step to compute the final sum

problem: there's no SYNCHRONIZATION before kernel ends

b) Host makes the final reduction

30880=~30 adds

problem: You can't make a single kernel for the entire loop

Gaspard2 Transformation Engine

Model Load

(uml file)

Transformation Library

Gaspard2 MDE/MDA

UML to MARTE

Memory Mapping

Scheduling

Task Graph

Tiler to Task

Instances of Ports

Polyhedron

Loop

GPUAPI

Functional

Tiler to Task

process all the tilers in the model
generates tasks and functions

Local and Global Task Graph

Scheduling

based on the Task Graph
assigns one possible scheduling discipline

QVTO (Query/View/Transformation Operational) : bit of code

Memory Mapping

create Memory Mappings
create data allocations based on flowports

OpenCL: Work-Groups and Work-items

Gaspard2 provides a transformation chain tool to execute different transformations.

At last, with a template Acceleo the engine generates the code files.

mapping Allocate::toKernel() : gpuapi::GPUAPI::Kernel {

init {

var kern := self.source->selectOne(true).oclAsType(AssemblyPart); //mapped to GPU

var ntasks := kern.type[StructuredComponent].ownedProperties[AssemblyPart]

->select(type[LinkTopologyTask]->isEmpty()).shape->shapeprod()->maxshape(); //threads count

}

name := kern.name;

assemblyPart := kern;

launchtopology := object gpuapi::GPUAPI::LaunchTopology{ //topology x,y,z = (int)total/256,256,0

dim := (ntasks/256 + 0.5).round();

dim += 256;

dim += 0;

};

// create functions (not tiler task) for each kernel

functions += kern.type[StructuredComponent].ownedProperties[AssemblyPart]->select(type[TilerTask]->isEmpty())

->map toIPFunctions(gpuapi::GPUAPI::placingOptions::Device);

//vars

variables += kern.ports[PortPart]->map FuncVars(gpuapi::GPUAPI::placingOptions::Device);

}

GPUAPI

searches associated tasks
defines kernels, variables, variable links (copyHostToDevice)
launch topology

Model to Text (Acceleo)

Dot Product (Atomic Functions)

Code Generated

/src/

.cl, .cpp, Makefile

#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable

#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable

Model Driven Architecture (MDA)

Unified Modeling Language (UML)

Meta Object Facility (MOF)

XML Metadata Interchange (XMI)

Modeling and Analysis of Real-time and Embedded systems (MARTE)

Query / View / Transformation (QVT)

Modeling Example (Array Product)

Result:

3x less speed
no performance

CODE GENERATED

Application Model

Architecture Model

Association Model

Deployment Model

QVT Transformations and Metamodels

(add, delete and transform elements)

Acceleo Model to Text

Linear Algebra and OpenCL

Electric Machines Simulation

Maxwell's equations
linear systems
We need a linear solver (iterative, direct, etc.)

Remarks

The transformation chain is hidden from the designer

Click and Generate

This example shows an OpenCL code, but the MetaModel GPUAPI allows to generate CUDA as well.

changes in Acceleo Model Templates

There are no BLAS, FFT, etc. libraries for OpenCL (not yet!)
Make yourself the functions you need
Our problem: Conjugate Gradient with Sparse Matrices (order of 500k or more elements)

Implementing...

Loop Control

switching between host and device
data remain in GPU memory

Remarks

Writing code

if we don't need to write the code...
a tool to generate OpenCL