Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Parallel Computing Platforms

Lecture on Parallel Computing Platforms

Andreas Wicenec

on 2 October 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Parallel Computing Platforms

Andreas Wicenec
Prof. of Data Intensive Research
ICRAR Parallel Computing Platforms Outline Parallelism More than one execution pipeline
Allows for more than one instruction per clock cycle
Example: Two pipelines
Two-way superscalar or dual-issue execution Execution pipeline Superscalar Execution More complexity! Improvement of execution rates
Execution stages include:
operand fetch
Overlapping of stages allows for faster execution Alternative to superscalar execution!
Packing instructions that can be executed concurrently into a single instruction for the processor (Example: Intel Xeon Phi is using 512 byte instruction words)
Software based
Compiler optimized
BUT: Compiler does not have access to run-time state Very Long Instruction Word Processors Implicit Parallelism Why would you even consider doing this?? Heaps of data Loads of instructions no time! Multiple Motivations
Multiple (good?) Solutions Computational Intensity EPIC: Explicit Parallel Instruction Computing
SMT: Simultaneous Multithreading
Multicore Processors
Vector Processors
Mix of stuff... Other (internal) Alternatives: Paradigms: SIMD: Single Instruction Multiple Data
MIMD: Multiple Instruction Multiple Data
SPMD: Single Program Multiple Data
MPMD: Multiple Program Multiple Data (not parallel)
SPSD: Single Program Single Data (not parallel, trivial) Examples ILLIAC IV 1968-1971
64(!!) bit Vector machine
200 MFLOPs (Design goal: 1 GFLOPs) CRAY J90 module (1994)
2 Scalar + 2 Vector processors
128 words cache
4 GB memory (48 GB/s bandwidth) CRAY XT3, 2004
MIMD, shared memory Tianhe-1A 2010
heterogeneous, CPU+GPU MIMD + SIMD 2.5 PFLOPs K computer 2011
10 PFLOPs Source: Top 500 list Trends External Parallelism Goals:
minimize number of connections
small width
scalability Interconnect Topologies Simple paradigm: Take one building blocks and replicate!
Building blocks can be cores, CPUs, SMP cards, complete computers or complete supercomputers using LAN or WAN (cloud) inter-connects.
Requires interconnect between CPUs and memory and between CPUs or computers.
Topology of interconnect delimits performance and can easily be the bottleneck.
Additional constraint from storage access. Topologies
Kautz Tree SMP - Symmetric Multi-Processing (<= 64 CPUs)
all processors are accessing common memory on the same rights
Used in desktops and servers
NUMA - Non-Uniform Memory Access
Global address space (as in SMP)
Faster to local memory
Slower remote
Distributed memory multicomputers
Communication via messages
Vector computers
Multiple functional units performing the same operation on vector registers (very long ones) e.g. vector addition, dot product
Almost disappeared Architectures in Detail Bit-level parallelism: Working on more than one bit (bytes, words, long-words...)
Instruction level parallelism: More than one instruction executed within one clock cycle
Data parallelism: Do the same for a lot of data (SIMD)
Task parallelism: Run the same task many times (Cloud) In other words Multi-core
Symmetric multi-processor (SMP)
Massively Parallel (MPP)
Grid or cloud computing External Parallelism 'Hierarchy' EPIC - University of Murdoch: Classic cluster
Pawsey 2 - MPP proprietary interconnect CRAY + Accelerator Parallel Platforms in Perth EPIC
9600 Xeon cores
800 HP servers
Infiniband interconnect
600 TB global storage FORNAX:
1152 Xeon cores, 96 SGI servers
96 Nvidia C2075 GPUs
500 TB global + 650 TB local storage
Dual Infiniband interconnect Cray Cascade platform (under development)
proprietary interconnect with flexible topology
Two stages - > 400 TFLOPs; > 1 PFLOPs
Accelerators - Intel Xeon Phi and/or GPUs
Disk storage + HSM tape storage
will be the fastest computer in Australia and probably the southern hemisphere. Pawsey 2 (next year) Memory very often bottleneck
Performance captured by"
But: Fast memory is very expensive
Work-around: Hierarchy of memory systems
Multiple cache levels (often 2)
Cache usually small (few MB maximum)
Cache part of CPU dye
Increased complexity - Need to consider
Temporal locality and
Spatial locality Memory Games Why parallel
Implicit Parallelism
External Parallelism
Memory Games
What we have in Perth We used to have parallel printer and even SCSI cables
Not used anymore since a number of years
Technical issues:
Cross-talk between wires
Limited length
not scalable much beyond 32 wires
more wires means shorter cables! Example: Cables
Full transcript