Introducing

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Mapping Array Operations to Specific Architectures

Updated Nov. 13, 2016

Transcript

Goal

Current

The End

Conclusion

Backend Design Criteria

Question: Is it possible to construct a language agnostic backend for high-level languages without sacrificing performance?

CAPE: C-Targeting Array Processing Engine

Language agnostic

Support a programming model not a specific language

Programming Model

High-level
Declarative
Array-oriented

Language integration via intermediate representation

Efficient

Target a performance comparable to straight forward hand-coded C/C++ for the same application

Question: Is it possible to construct a language agnostic back for high-level languages without sacrificing performance?

Language Integration

Map abstractions
vector bytecode - intermediate representation

Internal Representation BhIr

Annotated vector bytecode

Transformations

Optimization
Normalization
Fusion, grouping bytecode sequences

Code generator for array operations with parallelization and composition of multiple array operations

Caching JIT-Compiler and object storage for array operation kernels

Runtime instrumenting compilation, buffer management, array operation scheduling and execution

What

Why

Bohrium: a virtual machine approach to portable parallelism

Mads R.B. Kristensen, Simon A.F. Lund, Troels Blum, Kenneth Skovhede, Brian Vinter.

In proceedings of the Parallel & Distributed Processing Symposium Workshops (IPDPSW14)

NIELS BOHR INSTITUTE

Implementation Scope

Tools of the trade

Future / Ongoing Work

FACULTY OF SCIENCE

UNIVERSITY OF COPENHAGEN

DENMARK

CAPE: C-Targeting Array Processing Engine

Productivity

niels

numerically intensive expression language for science

Reconfigurable: BH_STACK=[cape,cluster_proxy,gpu]

Interactive environment via

Array Descriptor

Performance

Automatic Mapping

of Array Operations

to Specific Architectures

Array Operations

Element-wise aka map, zip operator over array(s)
Reduction
Scan

Programming Pitfalls

Performance

Correctness

Deadlocks
Race-conditions

#directives

C / C++ / Fortran

OpenMP / pthreads / Qthread

MPI

OpenACC / LEO

PGAs

OpenCL / CUDA

for efficient

hardware utilization

CPUs,

APUs,

Hybrid,

FPGA,

and clusters of them configured in

shared and distributed memory systems...

Heat Equation in C and OpenMP and MPI with Latency Hiding

Heat Equation in C and OpenMP and MPI

Heat Equation in C

Heat Equation in C and OpenMP

Heat Equation in Python / NumPy

Bohrium Processing Unit

Goal: ASIC for executing Bohrium Bytecode

Simon Andreas Frimann Lund

Mads R. B. Kristensen

Brian Vinter

High flops-to-watt ratio
Low latency
FPGA prototype

Languages used by TOP15 Computational Finance / Financial Engineering / "Quant Programs"

https://www.quantnet.com/mfe-programs-rankings/

As well as the University of Copenhagen
HIPERFIT industry partners with an affinity for APL

Collaborative effort
There is even more to it

November 13, 2016

WOLFHPC 2016 in conjunction with SC16

HIPERFIT

CAPE: C-Targeting Array Processing Engine

Array Operation Fusion

Fusion Fail

C99

OpenMP

Experimental

OpenACC
LEO

Greedy

Optimal

Improve mathematical models for Finance
Express them in verifiable Domain-Specific Languages (DSLs)
Execute them efficiently on High Performance Systems

Encore

Fusion of Parallel Array Operations

Mads R.B. Kristensen, Simon A.F. Lund, Troels Blum, James Avery.

In proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT16).

CAPE: C-Targeting Array Processing Engine

CAPE: Xeon PHI

Performance

best case 2x speedup
often speeddown
WHY!?

Allocation ~250MB/s

Transfer ~5.5GB/s

Codegen specialization

Flattening
Array contraction
Array operation composition
Array shape => Loop constructs
Checks buffer-references for aliasing

Data management

device allocation
transfer to/from device
data persistence

https://github.com/safl/offload/tree/master/mic

make broken

./broken

Codegeneration

parallization
specialization

CAPE: C-Targeting Array Processing Engine

SIMD utilization

Memory Management

Allocation and de-allocation of buffers backing array storage

Alignment

GPUs and Accelerators

data transfer: host <-> device
data persistence: buffer-reuse on device

Software Victim cache

Delay de-allocation
Reuse buffers

Thread Management

CAPE codegen takes SIMD into consideration, however, current implementation relies on auto-vectorization by the backend C-compiler

Brittle, example:

gcc often fails where commercial compilers prevail.

Simplistic approach by using #pragma omp simd [...]

did not yield expected results

Investigate further and possibly expand codegen with intrinsics or explicit means of ensuring the compiler that vectorization makes sense.

Thanks

Doubling the Performance of Python/NumPy With Less Than 100 SLOC

Simon A.F. Lund, Kenneth Skovhede, Mads R.B. Kristensen, Brian Vinter. In proceedings of the 3rd Python for High Performance and Scientific Computing (PyHPC13@SC13)

Implemented with HWLOC

Multi-core and MIC

# threads
Control core/thread affinity

CAPE: C-Targeting Array Processing Engine

Baseline: Python/NumPy

Baseline: Serial C99

python [-m bohrium] benchmark.py

CAPE-AC: Without array contraction

CAPE: WITH array contraction

https://github.com/bh107/benchpress.git rev. 0aa2942

https://github.com/bh107/bohrium.git rev. b4d3586

www.erda.dk/public/archives/YXJjaGl2ZS0xSWhQSmU=/published-archive.html

User knows his high-level

array-oriented programming

Knows the configuration

of the computing system

GPUs,

Accelerators,

Command-line interface via docopt

Combined: the high-level array-oriented programming model and its declarative nature provides implicit data-parallel operations and freedom for the backend to decide how to efficiently compute them.

Choose a template

Whiteboard (AI Assisted)

Unleash creativity and collaboration with our Whiteboard Prezi AI-assisted presentation template, seamlessly combining the simplicity of a traditional whiteboard with the power of digital innovation for dynamic and interactive visual storytelling.

Science - Cranium (AI Assisted)

Unleash your creativity and captivate your audience with our Cranium Prezi AI-assisted presentation template, designed to stimulate innovative thinking and deliver a visually engaging experience for any intellectual endeavor.

Constellations (AI Assisted)

Illuminate your ideas with our captivating Constellations Prezi AI-assisted presentation template, merging celestial elegance with professional design to elevate your content and guide your audience through a stellar visual experience.

See more templates →

Presentations from around the world

TRABAJO DE PREZI

valentina galindo maestre

caracterizarea zonelor naturale

nicu calenic

The Physics Behind the Egg Drop

Prezi Team

See staff picks →

Learn more about creating dynamic, engaging presentations with Prezi

Why Prezi is better