Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Analysis of Interconnect Performance

No description
by

Nitin Joshi

on 13 May 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Analysis of Interconnect Performance

Dr. Jiangjiang Liu

A Thesis Presented to:
Nitin Joshi
Presented by:
Simulation Results and Analysis
Contents for today's talk
Analysis of Interconnect Performance
with Partial-Match Compression for
Multi-core Systems

Methodology
Introduction and Motivation
Dr. Stefan Andrei
Graduate Student
Department of Computer Science
Lamar University
Dr. Lawrence J. Osborne
Associate Professor
Lamar University
Associate Professor and
Department Chair
Lamar University
Professor
Lamar University
Supervisor
Department Of Computer Science
The Problem
However, by increasing the number of cores on a single chip challenges arise with memory and cache coherence as well as communication between the cores.

As of 2003, performance improvement for a single-core processor has come a stand still
As personal computers have become more prevalent and more applications have been designed for them, the end-user has seen the need for a faster, more capable system to keep up.

Speedup has been achieved by increasing clock speeds and, more recently, adding multiple processing cores to the same chip
Multi-core system design
Uni-core Systems v/s Multi-core Systems
Introduction
• N number of cores
• Level one instruction cache (L1-I) for each core
• Level one data Cache (L1-D) for each core
• Level two cache (L2)
• Level three cache (L3)
• Main Memory

Uni-core System Design
• Single core processor
• Level one instruction cache (L1-I)
• Level one data Cache (L1-D)
• Level two cache (L2)
• Level three cache (L3)
• Main Memory

Working of a shared cache memory system in the multi-core systems
2) If there is a hit in L1-I/D cache, the corresponding instruction/data will be sent back to the core. If there is a miss in L1-I/D cache, the address will be sent to L2 cache
1) A core sends a request to L1-I cache or L1-D cache for instruction/data address
3) If there is a hit in L2 cache the corresponding instruction/data will be sent back to L1-I/D cache. If there is a miss in L2 cache, the address will be sent to L3 cache;
4) If there is hit in L3 cache the corresponding instruction/data will be sent back to L2 cache. If there is a miss in L3 cache, the address will be sent to the main memory and the main memory will return the corresponding instruction/data to the L3 cache.
The performance Bottleneck
Interconnect Issues.....
The performance Bottleneck (cont..)
What is an Interconnect?
In a layman's term, the interconnects are just like our roads.
The performance Bottleneck (cont..)
The Problem...
A funny fact is that, the problems faced by designers of new generation processor systems is no different than the problems faced by the designers of new generation civil transportation system. In fact, they face absolutely the same problem......"Traffic Bottleneck".
UUhhhhh......Too much Traffic
Just as our roads gets clogged up because of the increasing traffic and ever increasing population.
Our Solution
We reduce the amount of traffic flowing in the interconnect, by compressing the traffic before launching it on the interconnect.
The Partial Match Compression Technique
Related Works
Related works
Interestingly, Interconnect compression has been explored not only for improving performance, but also for reducing power consumption and cost.
Kant and Iyer from Intel:
Studied the effectiveness of interconnect compression (address and data) for high performance servers. Their findings show that the proposed compression scheme has a potential to reduce interconnect width while maintaining equal or better performance.
Related works (cont...)
Alameldeen and Wood:
They demonstrated that in chip multiprocessors (CMPs), both cache and off-chip interconnect compression resulted in performance improvements.
Jin, Yum and Kim:
They proposed an adaptive data compression technique to reduce on-chip network latency and power consumption.
Related works (cont...)
Thuresson and Stenstrom:
They used a value-cache based compression scheme (called Data Link Compression) for multiprocessor systems.
Methodology and Simulation setup
In order to analyze the interconnect performance with Partial Match compression scheme, we simulated a full scale multi-core processor system.
The simulator we chose is "The M5 simulator".
A wide range of benchmarks (both integer and floating point) from the SPECS2006 benchmark suite were used.
The M5 Simulator
What is a simulator?
By simulation we mean imitation of real thing or process, and simulator is a piece of software to model computer devices (or components) to predict outputs and performance metrics on a given input.

There are two types of simulators:
Microprocessor (Instruction Set Simulator), and
Full system simulator.
The M5 Simulator (Cont...)
Why use a simulator?
Because, In computer systems design, simulation of architecture may often be required prior to implementation. While small scale mockup of the desired architecture may be beneficial, they are often expensive and time consuming.
So what is M5?
M5 simulator is an object oriented (CPUs, busses, caches, etc. are the objects) and event driven full system simulator.
M5 is a modular platform for computer system architecture research, encompassing system-level architecture as well as processor micro-architecture
The M5 Simulator (Cont...)
The various platforms in which the M5 works are Intel x86-compatible systems running Linux, OpenBSD, or Cygwin, other Unix like Systems, and 64-bit machines.
The tools that are used in the simulation process are gcc, g++ 3.0+, python 2.4+, and scons 0.95 or 0.96.1.
Anyways why is it called M5? Strange name!!
I am not sure why its called M5. But, probably because this simulator was designed by
5

researchers from the University of
M
ichigan. So you see, the M comes from Michigan and 5, because there were 5 people responsible for it. NEAT!

The Target System Architecture Details
2-Core Configuration: 90 nm

Core clock rate: 2 Ghz
L1 I-cache/L1 D-cache [32 KB, 2-way set assoc., 16 B block size, 2-cycle latency],

L2 cache [1 MB, 8-way set assoc., 64 B block size, 4-cycle latency],

L3 cache [2 MB, 16-way set assoc., 64 B block size, 8-cycle latency],

L1-L2 interconnect [8 B data/instr. lines, 2-cycle],

L2-L3 interconnect [8 B data/instr. lines, 2-cycle],

L3-M interconnect [16 B data/instr. lines, 4-cycle]

The Target System Architecture Details (Cont...)
4-Core Configuration: 65 nm

Core clock rate: 2 Ghz
L1 I-cache/L1 D-cache [32 KB, 2-way set assoc., 16 B block size, 2-cycle latency],

L2 cache [2 MB, 8-way set assoc., 64 B block size, 4-cycle latency],

L3 cache [4 MB, 16-way set assoc., 64 B block size, 8-cycle latency],

L1-L2 interconnect [8 B data/instr. lines, 2-cycle],

L2-L3 interconnect [8 B data/instr. lines, 2-cycle],

L3-M interconnect [16 B data/instr. lines, 4-cycle]

The Target System Architecture Details (Cont...)
8-Core Configuration: 45 nm

Core clock rate: 2 Ghz
L1 I-cache/L1 D-cache [32 KB, 2-way set assoc., 16 B block size, 2-cycle latency],

L2 cache [4 MB, 8-way set assoc., 64 B block size, 4-cycle latency],

L3 cache [8 MB, 16-way set assoc., 64 B block size, 8-cycle latency],

L1-L2 interconnect [8 B data/instr. lines, 2-cycle],

L2-L3 interconnect [8 B data/instr. lines, 2-cycle],

L3-M interconnect [16 B data/instr. lines, 4-cycle]

A brief description to SPECS2006
CPU benchmarks
According to current standards, CPU’s are tested with current processor performance benchmarks listed in SPECS2006 benchmark suite.
In each simulation, each thread in each core is assigned a benchmark from SPEC CPU2006 (SPEC 2012).
It is expected that each simulation will be using 10M instructions
A brief description to SPECS2006
CPU benchmarks (Cont...)
Example of INT benchmarks
sjeng
Benchmark Language Application Area Description

C
Artificial Intelligence: Chess
A highly-ranked chess program that also plays several chess variants.
gcc
C
C compiler
Based on gcc Version 3.2, generates code for Opteron.
A brief description to SPECS2006
CPU benchmarks (Cont...)
Example of FLOAT benchmarks
soplex
Benchmark Language Application Area Description

C++
Linear Programming, Optimization
Solves a linear program using a simplex algorithm and sparse linear algebra. Test cases include railroad planning and military airlift models
cactusADM
C, Fortran
Physics/General Relativity
Solves the Einstein evolution equations using a staggered-leapfrog numerical method
The Partial Match Compression Scheme
General Bus compression scheme
The higher-order portion of an address is compressed by the compressor which is implemented as a small compression cache.
The lower-order portion, which is not very compressible due to its highly-varying nature, is transmitted as is on the compressed bus.
The width of the compressed bus is equal to the width of the compressed address.
At the receiving end, the original address is retrieved by looking up a register file present in the decompresser hardware.
To compress addresses, a set of bits from the incoming address, called the index (I field), is used to search the compression cache.
In the case of a hit, the I and W fields, along with the hit/miss control bit (C field) of the tag and the U field will form a compressed address with a combined bit width less than the original.
In case of a miss at the sending end, in the compression cache, the tag corresponding to the least recently used (LRU) entry in the set indexed by the I field is replaced by the tag of the new address.
If the compression cache is n-way set-associative, then one of n tags stored in the set indexed by the I field can potentially fully match the tag from the incoming address, and provide the way bits (W field).
The miss in the compression cache causes miss penalty for transmission because extra cycles will be needed for transferring the control field and also the entire address.
The Partial Match design paradigm
and block partition we chose
In this section, we describe the partial match compression (PMC) cache.
PM follows the general bus compression scheme, with some improvements.
As the name suggests, in this cache we allow partial matches to register as compression hits.
Thus, we check for the longest match between the tag portion stored in the compression cache and the tag portion of the incoming address.
Partial Match Logic for K=3
Since there is more redundancy in the higher-order bits of the address, we consider k possible groups of bits (shown in Figure as part 0, part 1, and part 2 for k = 3) ending at the most significant bit (MSB) of the incoming address as candidates for partial match.
The Partial Match design paradigm
and block partition we chose (Cont..)
There are three main hit/miss cases in PM:
Complete Hit
Partial Hit
Complete Miss
These are explained in the next slide
The Partial Match design paradigm
and block partition we chose (Cont..)

Complete hit
: If the tag matches fully, PM performs exactly the same as the normal bus compression scheme. The control bit CH is set high to indicate a complete hit. If not, it is set low.
• Partial hit: If a partial match in any of the k groups occurs, the control pattern corresponding to the longest match is transmitted to indicate a partial hit.
For example, when k = 3, control pattern C1C0 will be 00, 01, or 10 for part 0 hit, part 1 hit, or part 2 hit, respectively. The remaining portion of the higher-order part of the address (that did not match the tag), Tmiss, is sent in uncompressed form as is the lower-order portion of the address.
In this case, less bits are transfered over the compressed bus as shown below
• Complete miss: In case of a complete miss in PM compression cache, when none of the partial matches succeed, the entire address is sent along with the control bits CH = 0 and C1C0 = 11.
Even though more bits need to be transferred in PM than in the case of normal compression–due to the extra control bits for indicating a complete miss–the performance will not be affected much because the number of complete misses will be much less in PM than in the case of optimized BE.
Simulation Results and
Analysis
The effectiveness of any compression scheme can be evaluated in a number of different ways.
We could measure :
In this study we will mainly be concerned with the last criteria.
Compression Ratio
What is compression ratio?
Let us take a brief look at what it is.
A very logical way of measuring how well a compression algorithm compresses a given set of data is to look at the ratio of =
This ratio is called the
compression ratio
.
System Architectures we investigated
In our study, we considered architectures with
2-core with 1-thread per core
2-core with 2-thread per core
4-core with 1-thread per core
4-core with 2-thread per core
8-core with 1-thread per core
The block of bits to be compressed is a 32-bit block that contains both information and Data(ID).
The compression analysis is done for the traffic flow at the interconnect connecting L3 cache and the main memory.
Lower compression ratio means better compression.
Simulation Results
Every target system mentioned in the previous slide was simulated to obtain two set of results.
1) Compression ratio for fully associative and 4-way set associative compression cache designs.
2) Compression ratio for ID blocks with one byte of un-compressible unit.
Compression Ratio for all systems
Analysis
In this sub-section we would consider analysis of only one system.
2-core with 1-thread per core
We have to go home
Analysis (Cont...)
2-core with 1-thread per core system
In order to investigate the performance of our compression cache we simulate the target systems with different compression cache sizes.
Lower compression ratio means better compression.
Analysis for a fully associative cache
In this system, we are concentrating on analyzing the system with a fully associative cache
This Figure shows that the cache configuration with 16 elements has the best compression ratio as compared to other tested configurations.
[#elements in cache][Block size][Tag size]
Analysis for a fully associative cache (Cont..)
Now the important question is, why we have this trend?
To understand this trend, we need to dig deeper.
The next step is to find out how many complete, partial hits and complete misses we had in each case.
Now, we focus our attention to the following figure.
Analysis for a fully associative cache (Cont..)
Number of Elements in cache
From this figure, we deduced that in case 1 there are three dominant factors that decide the compression ratio. These factors are:
Complete Hit rate
Partition[2] hit rate (Most sig. 8 bit)
Complete Miss rate
Conclusion
We now know that the complete hit rate and the partition[2] hit rate are the dominant factors in deciding the compression ratio.
Analysis for a fully associative cache (Cont..)
We also know that the total bits required for representing them also increases with the increase in the cache size.
This is the reason why we see an increase in compression ratio with the increase in the cache size. Making the cache with a small amount of elements more productive.
Conclusion
In this work, we studied the performance of Partial Match, which aims at compressing both information and data at the interconnects throughout the memory hierarchy.
We find that our Partial Match (PM) compression technique is in-fact very effective in
we also conclude that, by partitioning the block of bits for Information and Data we
Conclusion
We also find that our PM compression technique is not dependent upon the compression cache type,
From the analysis of compression ratio for different cores, we observed that,
Questions
Thank you
Introduction and Motivation
Related works
Methodology and implementation details
Simulation Setup
Results and Analysis
Conclusion
Questions
Analysis of Interconnect Performance
with Partial-Match Compression for
Multi-core Systems
Department Of Computer Science
Joshi 1
Joshi 2
Joshi 3
Joshi 4
Joshi 5
Joshi 6
Their sole purpose is to provide a medium for transportation of objects (data/instructions in our case).

Just like cars and other vehicles are the objects that use these roads to commute from city A to city B. Similarly,
bits
are the commuters of interconnects.
Joshi 7
Joshi 8
Joshi 9
Joshi 10
Joshi 11
Joshi 12
Joshi 13
Joshi 14
Joshi 15
Joshi 16
Joshi 17
Joshi 18
Joshi 19
Joshi 20
Joshi 21
Joshi 22
Joshi 23
Joshi 24
Joshi 25
Joshi 26
Joshi 27
Joshi 28
Joshi 29
Joshi 30
Joshi 31
Joshi 32
Joshi 33
Joshi 34
Joshi 35
Joshi 36
Joshi 37
Joshi 38
Joshi 39
Similarly, the interconnects in multi-core systems are getting clogged up -- large amounts of traffic produced by the increasing number of cores per processor.
Hence the problem ---- "Performance Bottleneck".
- Up to 18% performance improvement for an 8-processor
- Up to 41% off-chip communication bandwidth increment.
- Up to 44% packet latency improvement and
- An average of 36% power consumption reduction for a 16-core CMP can be achieved.
- They were able to reduce the compression overheads introduced by this compression technique and
- Found that it is possible to reduce most of the value-cache generated traffic, but at the cost of less efficient compression.
- the relative complexity of the algorithm,
- the memory required to implement the algorithm,
- how fast the algorithm performs on a given machine,
- how closely the reconstruction resembles the original data, and
- the amount of compression achieved.
# bits after compression
# bits before compression
Due to Time constraints,
- reducing the overall interconnect traffic for any multi-core architecture designs.
- increase the probability of complete hits and
- reduce the overall bit flow through the interconnect components.
- however the compression ratio depends directly on the compression cache size we choose.
- even if we increase the number of cores from two cores to eight cores, the compression ratio achieved by our PM technique does not vary much (up to 0.20 for 2-core-2-thread per core and up to 0.22 for 8-core-1-thread per core architecture).
Full transcript