Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Improving efficiency of data intensive applications on GPU using lightweight compression

No description
by

Piotr Przymus

on 30 August 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Improving efficiency of data intensive applications on GPU using lightweight compression

Introduction
Improving efficiency of data intensive applications on GPU using lightweight compression
Piotr Przymus
Nicolaus Copernicus University
Chopina 12/18
Torun, Poland

Krzysztof Kaczmarski
Warsaw University of Technology Plac Politechniki 1, 00-661 Warszawa, Poland
Target:
improve efficiency of some
data intensive applications
on GPU
Solution:
improve efficiency of data transfer between disk, through RAM, global GPU memory and processing unit
Contribution of this work
we have developed two highly optimized GPU parallel (de)compression methods
we evaluated proposed algorithms in data-intensive experiment involving time series and pattern matching
G
eneral-
P
urpose computing on
G
raphics
P
rocessing
U
nits
Conclusions and Future Work
Method Description
Lightweight Compression
PFOR and
PFOR-DIFF
Lightweight Compression
on GPU
Experiment description
Results
Motivation
Data intensive
applications characterization
Research hypothesis
Benefits of lightweight
compression on GPU
GPGPU
Data itself, its size, complexity or rate of acquisition is the biggest problem;
Require fast data access and minimization of data movement;
Expects high, preferably linear, scalability of both hardware and software platform.
Data-intensive applications may increase their efficiency by utilization of lightweight compression methods in GPU global and shared memory.
Cost of data decompression can be amortised by fewer global memory reads.
Additional benefit may be obtained by proper utilisation of shared memory decompression.
Data-intensive applications may benefit from GPU computing when applied for smaller data sets than without the method.
shorter time to load data from disk
shorter time to copy data from RAM to device (GPU)
possibility to fit larger set of data directly on the GPU
reduced data size in storage
ultra fast compression and decompression
GPU programming offers tremendous processing power and excellent scalability with increasing number of parallel threads
dominant frameworks for programming on GPU are OpenCL and Nvidia CUDA
Some GPU limitations:
obligatory data transfer between RAM and the computing GPU which generates additional cost of computations
wrong memory access scheme can drastically reduce performance
various memory types limitations
Figure. Achieved compression level of lightweight compression and conventional
compression methods for various datasets (higher is better).
Data sets: TR1, TR2 - star light curve; IN1, IN2, TS1 - CFD and stock quotations;
TS2 - radioactivity in the ground; PH1, PH2 - ECG data.
Data settings
Experiment settings
Experiments
In our experiment we used the following equipment: two Nvidia cards - Tesla C2050 / C2070 with 2687 MB and GeForce 2800 GTX with 896 MB (from CUDA Capability Major 2.0 and 1.3) - 2 x Six-Core processor AMD Opteron (tm) Processor with 31 GB RAM, Intel (R) RAID Controller RS2BL040 set in RAID5, 4 drives Seagate Constellation ES ST2000NM0011 2000 GB. We used the Linux operating system kernel 2.6.38-11 with the CUDA driver version 4.0.

Experiments were conducted on different sizes of data (2 MB, 4 MB, 10 MB, 50 MB, 100 MB, 250 MB, 500 MB, 750 MB, 1 GB). For each size, 10 iterations were performed and the results were averaged. In addition, to ensure the same disk read times for the same size of data we averaged disk read times for each data size.
To evaluate our solution we prepared following experiments:
1. GPU algorithm without decompression compared to CPU algorithm without decompression.
2. GPU algorithm with decompression in global memory compared to CPU algorithm with decompression.
3. GPU algorithms with decompression in shared memory compared to GPU with decompression in global memory and CPU algorithm with decompression.
We plan to design a general purpose library which could be used in a similar way to CUDA Thrust library. Utilization of common iterator pattern with memory blocks overlapping may offer interesting abilities breaking barrier of inter block threads communication.
We plan to redesign PFOR and PFOR-DIFF compression schemes to better fit GPU architecture and eliminate limitations indicated in this study.
Future Work
References
Thank you!
As an initial set up for this research we use time series matching problems as an exemplar of data intensive applications. As a distance function we can use the Hamming distance. Due to the lack of complex calculations main limitation of this problem is data transfer.
Figure. Performance of a whole pattern matching algorithm with and without
compression for 1GB of data, excluding IO times (lower is better).
g - global memory decompression, s - shared memory decompression.
Performance improvments when using lightweight compression(higher is better)
(a) Speed up on IO read time
(b) Computations speed up when using lightweight decompression compared to single threaded CPU version without decompression. This includes IO.
Compression and decompression routine selection
PFOR-DIFF
reconstruction step
Shared memory
vs global memory
This will allow to estimate possible general speed up of an algorithm when run on a GPU and also will be a starting point for other experiments by registering time necessary to perform pure computations without any decompression overhead.
This will show potential speed up when data transfer is minimised by data compression.
This will show the speed up when minimising GPU global memory reads.
Classical compression algorithms
computational intensive
created to minimize size of data
limited gain from GPU power
PFOR-DIFF reconstruction step involves iterating over decompressed buffer. Naive implementation will bring the performance down.
Solution:
use properly implemented parallel prefix sum
decompression threads do not form coalesced reads leading to a drop in performance
current solution:
pack data in texture memory
texture limitation: maximum size elements in case of one dimensional array
future work solution: involves creating new algorithm specialized for GPU
CUDA architecture can not cope well with the types of readings less than 32 bits in length
casting compressed data (char array) to array of integers (decompression requires the reverse conversion)
.
Coalesced reads and reading types less than 32 bits in length

We prepared two versions of algorithms for GPU:
using shared memory
shared memory is hundreds of times faster than global
not all solutions allow the use of shared memory
using global memory
still an good alternative for other solutions
to use decompression
put decompression macro at beginning of your kernel
declare array of 24 integers (safely located in threads registers)
after decompression phase array can be safely used for other purposes
Other details
shared
global
Example:
{1,2,3,3,2,2,2,3,3,1,1,
64
,2,3,1,1}
we could use bits to encode the frame if not the value 64
we have to use bit encoding thus wasting 4 bits for each element
Outliers problem
Solutions for outliers problem
Most noticeable:
PFOR by Zukowski et al. in 2006
Adaptive FOR by Delbru et al. in 2010
New PFOR by Yan et al. in 2009
PFOR
Compression block consists of three parts:
compressed data
exceptions offsets
remainders of exceptions

How compression works:
determine bit encoding length for frame (considering size of all three parts)
compress using FOR
when the outlier is processed
preserve its position in an offset array
divide its value into bits that are stored in the first section and the remainder to be retained in the third
compress second and third section (separately) using FOR

How decompression works:
decompresses the data, offset and exceptions sections
iterate over the offset array, and restore the value of by applying patch from exception array
Data types
array of:
char
int
fixed-decimal numbers data
FOR
F
rame
o
f
R
eference
split data into frames
determine range of values for frame (e.q. convert values into interval {0,...,max-min})
write each value using bits
Compression and decompression are highly effective:
routines contain no branching-conditions
functions are loop-unrolled
only bit shift and mask operations

This implies that there are dedicated compression and decompression routines prepared for every bit encoding length.

Example:
{1,2,3,3,2,2,2,3,3,1,1,2,3,1,1} We may use bits to encode values in the frame.
Compression:
for the data frame of length $n$ loop is performed each $m$ values (where $m$ is a multiple of 8) and compressed using the same function at each iteration step.
Decompression:
We iterate through the $m$-coded values, and we use a function that decodes $m$ values.
FOR-DIFF
FOR-DIFF algorithm is similar to FOR
FOR-DIFF can significantly improve the compression ratio for certain types of data
stores the differences between successive data points in frame
Compression:
calculate the differences
compresses them using the FOR compression scheme.
Decompression:
decompress using FOR
reconstruct the original data from the differences
To be able to decompress a data frame, decompression functions must be specified.
in the case of the CPU one can use function pointers
on GPU this is dependent on the GPU computation capabilities (cc)
for cc lower than the 2.x is not possible to use function pointers (macros may be used instead)
for cc equal to 2.x and higher there is some form of function pointers (in future work we will use this feature)
We used macro based solution which is compatible with both architectures:
this requires information about the compressed data at compile time
as the optimum compression parameters were found to be constant within particular sets of data this is not a serious limitation
Full transcript