Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

Productivity vs Performance

The Best of Both Worlds?

Chapel

Domains

in one slide

High Productivity Computing Systems (HPCS) aka

High Productivity High Performance Languages

  • Code-blocks / Expression
  • begin
  • cobegin
  • coforall
  • sync
  • serial
  • On Data
  • atomic variables
  • sync variables

Choose one?

Python Approaches

PyMIC

Parallel Programming Languages

Shared Memory

Distributed Memory

Convenient notation

  • X10
  • Fortress
  • UPC
  • Chapel

mpi4py

pyOpenCL

First-class index-set

  • dense
  • sparse
  • strided
  • associative
  • un-structured

Handles:

  • Memory layout
  • Distribution

pyCuda

cython

Pythran

Hardware APIs

Java "ish"

Language extensions

Parallel language constructs

threading

multiprocessing

pyChapel

DEAD

C/C++ "ish"

"Modern" feel

Copperhead

  • Loop constructs
  • Iterators
  • Parallel Iterators
  • I/O

Numba

Parakeet

Locale Abstraction

  • Hierarchical
  • Abstract unit of target architecture
  • Reasoning about locality and affinity

PGAS

  • Public vs Private determined by scoping

Express:

begin on Locale[0]

....

begin on node.left do

search(node.left)

Probing:

locale.(physicalMemory, id, name, coreCount, etc.)

ctypes

swig

Fwrap

pyChapel

function specialization

jit compilation

@decorators

  • Alive and well, OpenSource with an active team at Cray Inc. including academic collaborations
  • Designed from scratch with heritage / lessons learned from HPF and ZPL.
  • Multi-resolution: high-level abstractions implemented within the language itself, abstractions can be pealed off.

Global-view abstractions

  • forall loops
  • High-level: A = B + alpha * C
  • Mapped to hardware with Domain Maps using Locales
  • Default strategies user-overridable

FFIs / wrappers

npbackend

Bohrium

NumPy / SciPy

Pythran

Figure from Code Complete by Steve McConnell

Chapel

Python Approaches - Applied

From a Python perspective...

The reference implementation of Python, cPython, makes all of the above possible

  • Rich support for Python <-> C-interoperability
  • Although other approaches exists C-interoperability is the main driver for Python performance

pyChapel

Behind the scenes

However...

Python/Chapel interoperability module

Arguments are cast or converted to equivalent types

  • Scalar arguments are passed by value
  • Array-metadata is copied
  • Array-data is passed by reference

Extern decorators

  • @Chapel ( Chapel code )
  • @FromC ( C code )
  • @FromC ( Third-party libraries )

Loss of high-level abstractions, less like Python more like "Convenient" low-level programming

Module Compiler

FFI

chapel-for-python-programmers.rtfd.org

Usage Examples

Python/NumPy + pyChapel

Python/NumPy

Python/NumPy + pyChapel

Python/NumPy

Rosen-filter, kernel-function from NumPy/SciPy software stack.

Python/NumPy

a Portable and Scalable Language for Scientific Computing

Python/NumPy + pyChapel

Python/NumPy

Chapel

  • Take some market data
  • Run some analytics / modeling on the data
  • Visualize the results

Setup simulation from dataset

Run simulation

Reduce data and analyze results

  • Multiple distinct models for expressing parallelism
  • Low-level hardware instrumentation
  • Target Specific

Finance

Science

Rosen

Observations when running the above

Not really

what we wanted...

  • A constant compilation overhead of ~3 seconds per module plus a varying amount of time depending of the compiled code.
  • Overhead is amortized with speedups ranging from 1.8x to 15.0x
  • Call overhead was not measurable when comparing the pyChapel and Chapel implementations of the rosen-kernel. Indicating sufficiently low call overhead.

pychapel.rtfd.org

Future Work / Current Limitations

Acknowledgements

Thanks to the Chapel Team!

Bringing it all together

an approach to Python as a High Performance Scripting Language

npbackend, how does that help?

And how do I use it?

npbackend - Usage and User Interface

Indirection Layer

Goals

  • Legacy support
  • Minimally Intrusive
  • Graceful degration

Enable Use of multiple "targets"

  • Bohrium
  • Numexpr
  • libgpuarray
  • pyChapel

... <insert your backend here>

npbackend, facilitating target optimizations

Retarget without modifying the application

  • NPBE_TARGET="gpu"

Controlled via enrivonment variables

FFI - Type support

  • Expand NumPy ndarray support
  • Make type-declaration optional
  • Auto-create export declarations

Module compiler

  • Expand type-parsing and type-mapping

npbackend target

  • Implicit acceleration of Python/NumPy
  • Using FFI / decorators behind the scene

Parallelism and Locality

  • Currently only node-level, does not leverage all the capabilitites of Chapel

Examplified

that turns

Python / Numpy

into an array language

Lazy / Deferred Evaluation

  • Construct numeric kernels
  • Loop fusion
  • Array Contraction
  • Dead code elimination
  • Common subexpressions

...

My supervisor and colleagues

Python / NumPy / SciPy for program definition, data description, and interaction

  • Straightforward sequential semantics
  • Let the user program close to their domain, not close to the hardware

Obtain performance through data-parallelism of array operations

  • Transparently map operations to Chapel via npbackend targeting pyChapel

And when the abstractions and mappings fail...

  • Provide an alternative to C ...

... an alternative with a unified model for parallel computations

  • With high-level data-parallel constructs to do most of the lifting and ...

... syntax and semantics close to Python

  • Provide Chapel as the underlying language empowering the Python user

Benchmarks

Heat Equation

  • Domain: 3000x3000
  • Iterations: 100
  • Measuring elapsed wall-clock in seconds
  • Comparing NumPy to npbackend with different targets
  • Comparing NumPy to NumPy through npbackend to measure overhead

Results

  • Numexpr x2.2
  • BohriumCPU x2.6
  • libgpuarray x5.6
  • BohriumGPU x18

Shallow Water

Intel Xeon E5640, 2.66Mhz

12MB, LLC

96GB DDR3, Main memory

NVIDIA Geforce GTX 460, 1GB DDR5

With: GCC 4.8.2, OpenCL 1.2, Linux 3.13, Python 2.7, and NumPy 1.8.1.

Average of three runs, deviation from

the mean used as verification.

HIPERFIT: This research has been partially supported by the Danish Strategic Research Council, Program Committee for Strategic Growth Technologies, for the research center 'HIPERFIT: Functional High Performance Computing for Financial Information Technology' (hiperfit.dk) under contract number 10-092299.

This research has been partially supported by the Danish Strategic Research Council, Program Committee for Strategic Growth Technologies, for the 'High Performance High Productivity' project under contract number 09-067060.

Especially: Brad Chamberlain, Lydia Duncan, Thomas Van Doren, Elliot Ronaghan, and Ben Harshbarger.

  • Domain: 2000x2000
  • Iterations: 100

Results

  • Numexpr x2
  • BohriumCPU x2
  • libgpuarray x3.7
  • BohriumGPU x12

Separating NumPy API From Implementation, Kristensen, Mads R. B. and Lund, Simon A. F. and Blum, Troels and Skovhede, Kenneth. In proceedings of 4th Workshop Python for High Performance and Scientific Computing (PyHPC14@SC14) .

pyChapel

Chapel for Python Programmers

pychapel.rtfd.org

chapel-for-python-programmers.rtfd.org

Language Concerns

R

C / C++ / Fortran

MATLAB

Python

OpenMP/pthreads

MPI

pragmas

CUDA

OpenCL/CUDA/OpenACC

Modeling

Visualization

Data Analysis

Experimentation

For efficiency

npbackend

also known as the Python module

Scipting Language Performance Through Interoperability

or getting the best of multiple worlds...

Lowlevel interfaces

Explicit

Specialization

Implicit

Data-parallelism

GPUs,

Accelerators

Chapel

chapel.cray.org

CPUs

Static Compilation

Libraries

JIT / Dynamic Compilers

Intrusion

APUs,

Hybrid,

FPGA

Learn more about creating dynamic, engaging presentations with Prezi