Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Leveraging Hadoop Framework to develop Duplication Detector

No description

Priyanka Sethi

on 21 August 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Leveraging Hadoop Framework to develop Duplication Detector

Duplication Detection and Analysis in Hadoop Framework

2.5 quintillion bytes of data generated each day
90% of the data in the world today created in the last two years alone
This data comes from everywhere :Sensors, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few
This overwhelming deluge of data, difficult to manage using on-hand database management tools or traditional data processing applications – Big Data
More than simply a matter of size; opportunity to find insights in new and emerging types of data and content, making businesses more agile and answer questions that were previously considered beyond reach.

What makes Hadoop unique is its simplified programming model i.e. MapReduce
MR is a parallel programming technique proposed by Google for large-scale data processing in a distributed computing environment.
The computation takes a set of input key/value pairs, and produces a set of output key/value pairs.

Data Deduplication
Much of the information in the repositories comprises redundant data
Enterprises are rigorously, making their move towards different datadeduplication technologies.
As our intent is to reduce the cost of storing data at the file system level by achieving space savings without affecting end-user experience, and to propagate these space savings within the storage environment at the file system level, the appropriate technology is data deduplication.
But what if we want to retain certain copies intentionally and need wishful elimination???
Before embarking on an actual implementation, certain key points like the sheer volume, variety, velocity, variability, content structure and complexity of data, the migration process and the need to provide timely search results should be properly accounted.

Problem at hand…
As this torrent of digital datasets continue to outgrow in datacenters, the focus needs to be shifted to stored data reduction methods and that too pertaining to NoSQL databases as traditional structured storage systems continuously tend to face challenges in providing the corresponding throughputs and capacities necessary to capture, store, manage and analyze the deluge of data
Traditional Databases are built for transactional, high speed analytics, interactive reporting and multi step transactions – among other things.
Undoubtedly, they can be thought of as a nimble sports car handling rapid, interactive queries on moderate and small datasets.
But what if we wish to store vast amounts of data, run queries on huge, complex datasets and capture data streams at incredible speeds???
Hadoop proves to be a robust locomotive engine powering larger workloads that takes large amounts of data and more complex queries.

Large –scale distributed batch processing infrastructure
Its true power lies in its ability to scale to thousands of computers, each with several processor cores
Designed to efficiently distribute large amounts of work across a set of machines
Rather than fetching data from different places (nodes) to a common central location for processing, processing then occurs at each node simultaneously, or in parallel.
Hadoop, thus achieves high data locality resulting in high performance making it the best choice for analysis and extraction of information from large datasets in a simple, cost-effective, manageable and scalable manner.

Converting our data files into large files
Ingest data files into HDFS [Limit if required]
Convert them to large files to make them ideal candidates for Hadoop processing
We achieve this by converting them to flat files called sequence files that provide a persistent data structure for binary key-value pairs
Sequence files are splittable.
Thereafter depending on where our data are residing and bandwidth limitations, we can simply spawn standalone JVMS to create sequence files directly in HDFS
Check for redundant files
For this, we use a cryptographic hash function MD5 to process the obtained sequence files that produces a unique 128 bit (16-byte) hash value (fingerprint/signature) for identifying and comparing files to detect duplicity.
MapReduce function can be used intensively and key value pair of the location and message digest is stored.
The Reduce function takes into account the different values given by the mapper, shuffles and allots proper namenode and tasknode for the process of reduction.
The reduction process counts the number of common MD with the same key and joins them to form another list which is then printed.
These fingerprints are small files that prove to be much quicker
There are currently over 700 million Facebook users, 250 million Twitter users and 156 million public blogs
Each such Facebook update , Tweet, blog post and comment creates multiple new data points, semi-structured and unstructured, sometimes called Data Exhaust.
Traditional data warehouses and other data management tools are not up to the job of processing and analyzing Big Data in a time or cost efficient manner. Therefore new ways of processing and analyzing Big Data are required.

Adept at storing large volumes of structured as well as semi structured data.
Provides global access to files in a cluster.
Files in HDFS are divided into large blocks, typically 64 MB, and each block is stored as a separate file in the local filesystem.
Highly fault tolerant, designed to be deployed on low-cost commodity hardware.
Provides high throughput access to application data
Suitable for applications having large datasets

Avoid redundant files
If the same data gets processed through MD5 function multiple times, the same hash is generated each time.
Thus, if the recently generated fingerprint has matched any of the existing ones, then the data is deemed to be redundant and is not stored.
If the newly encountered fingerprint does not match the prior ones, then the data is stored and the hash index is updated with the new hash.

HDFS is implemented by two services: the NameNode and DataNode.
The NameNode is responsible for maintaining the HDFS directory tree, and is a centralized service in the cluster operating on a single node.
The NameNode does not store HDFS data itself, but rather maintains a mapping between HDFS file name, a list of blocks in the file, and the DataNode(s) on which those blocks are stored.

The user of the MapReduce library expresses the computation as two functions: Map and Reduce.
All map and reduce operations are tasks run on the tasktrackers in the Hadoop cluster.
These individual map and reduce tasks are monitored from inception to completion by the jobtracker

The call of the time...
To design and develop a duplication detection system that identifies multiple copies of the same data as redundant at the file level itself and that too before transmission
To analyze different datasets varying the various performance and optimization enhancers (available in MapReduce, Hive and Pig) and gain insights about when one should work at what stage (of Hadoop framework) and which optimization and performance enhancer to leverage at that particular stage

Deduplication techniques can be classified into three broad categories based on technologies, process and type.
Based on the Technologies (how it is done)
a – Hash based Deduplication
1 – Fixed-Length or Fixed Block 
2 – Variable-Length or Variable Block
b – Content or Application-aware Deduplication
Based on the Process (when it is done)
a – In-line deduplication
b – Post-process deduplication
Based on the Type (where it happens)
a– Source or Client Side Deduplication.
b – Target Side Deduplication

This work leverages Hadoop framework to design and develop a duplication detection system that identifies multiple copies of the same data as redundant at the file level itself and that too before transmission i.e. at client side.
Helps in wishful elimination and thereafter in controlling the number of replicas that were traditionally duplicated.
Helps in offloading the processing power requirements of the target to the client nodes reducing the amount of data that is to be sent onto the network
Proves to be extremely beneficial in bandwidth constrained environments.

Convert data files into large files
Check for redundant files
Avoid/Remove redundant files
Compression proves to be an efficient technique in enhancing the performance of our duplication detector.
Sequence files prove to be the ideal candidates to help us with compression.
There are three SequenceFileWriters based on theSequenceFile CompressionType used to compress key/value pairs:
UncompressWriter, Uncompressed records.
RecordCompressWriter, only compress values.
BlockCompressWriter, both keys and values are distinctly put together in blocks and are then compressed.The size of the 'block' is configurable
File Block Size
Unit size of data exchange among all the nodes. It is referred to as “dfs.block.size” in hdfs-site.xml.
Number of map tasks actually depends on the input dataset size and the block size; a larger block size will result in fewer map tasks.
Input data size of one of our datasets = 95 GB
If dfs.block.size= 64 MB then minimum no. of maps= (95*1024)/64 = 1520 maps.
If dfs.block.size= 128 MB then minimum no. of maps= (95*1024)/128 = 760 maps.
If dfs.block.size= 256 MB then minimum no. of maps= (95*1024)/256 = 380 maps.
This illustrated just one instance of performance tuning.
We need to balance all our resources (computation, I/O, network bandwidth, memory) on our clusters to achieve high performance.
Identify resources that were the cause of potential bottlenecks and address their limitation by understanding the configuration and requirements of the job and thereby achieving the right balance.

Thank You...

Priyanka Sethi
Assistant Professor
Hypothetically, any Hive and Pig job could be rewritten using MapReduce
MR definitely proves to be faster when we do not leverage any of the optimization options/rules or performance enhancers.
MapReduce undoubtedly performed better for larger datasets whereas for the smaller ones Pig and Hive finished early.
Pig and Hive did not turn up well for datasets containing a large proportion of image, audio and video files.

Want to dive in deep, acquire a fine grained control on how to process data and work on a complex job with several primitives, highly unstructured datasets (images, videos, audios, log data etc.), semi-structured or ambiguously delimited datasets, we go for Java MapReduce.
With the level of flexibility provided by Hive and Pig, we somehow achieve our goal writing UDF’s, changing query plans etc. but it is not that smooth, convenient and efficient. As such, they spawn off a bunch of MR jobs which could have been done with less.
On the contrary, Pig and Hive allow us to do stuff with fewer lines of code, thereby reducing the overall development and testing time. As such, they truly boost the productivity for data scientists and engineers.

Hive and Pig…
Both generate MapReduce jobs from a query written in higher level language
Free users from knowing all the minor details of MapReduce and HDFS
Pig (Procedural data flow language, popular among programmers and researchers)
Hive (Declarative SQLish language popular among analysts)
Full transcript