Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

HADOOP

No description
by

on 30 October 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of HADOOP

HADOOP
Big Data ?
Why Hadoop?
What is Hadoop ?
Map Reduce
Software framework for easily writing applications which process vast amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks.
Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
What is Big Data?
Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured.
Why Is Big Data Important?
And big data may be as important to business – and society – as the Internet has become. Why? More data may lead to more accurate analyses. More accurate analyses may lead to more confident decision making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk.
Problems Associated with Big Data
Data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,[3] search, sharing, transfer, analysis,[4] and visualization.
Head-ache!!!
The Ox Analogy
Computer Systems
Hadoop : Free, Java-based programming framework
Supports the processing of large data sets in a distributed computing environment.
Part of the Apache project sponsored by the Apache Software Foundation.
Hadoop Distributed File System (HDFS) : Self-healing cluster storage with very high data throughput.
Properties
Architecture
HDFS (storage) and MapReduce (processing) are the two core components of Apache Hadoop.
A small Hadoop cluster will include a single master and multiple worker nodes.
The master node consists of a
JobTracker
TaskTracker
NameNode and
DataNode.
A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes.
The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible.
A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.
Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept.
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.
Apache Hadoop
Ecosystem
HDFS
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Sqoop
SQL-like Interface for HDFS
Hive
Map & Reduce
Other
RDBMS
Flume
Hadoop Core
Data warehouse system for Hadoop
Facilitates
Easy data summarization
Ad-hoc queries
Analysis of large datasets stored in Hadoop compatible file systems.
Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Processing - Script - Engine
Pig
native
JDBC
native
Machine Learning Framework
Mahout
Workflow - Scheduler
Oozie
HBase doesn't stand for "Hadoop Database"
HBase
A "NoSQL" data store
Can store very large amounts of data(Gigabytes, terabytes in a table)
Scales with very high data throughput(approx. 100,000 inserts / second)
Is very good "sparse data"
Tables can have many thousands of columns; zero values ​​do not occupy any space
Has a simple access model:
Insert, read, delete rows, partial or full table scans
Only one column (the 'row' key) is indexed
Simple Join
Hive Example
SELECT
stock.product, SUM(orders.purchases)
FROM
stock
INNER JOIN
orders
ON
(stock.id = orders.stock_id)
WHERE
YEAR(orders.order_date) = 2012
GROUP BY
stock.product
List of items with quantity of sales
High-level platform for creating MapReduce programs used with Hadoop.
Language used Pig Latin.
Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems.
Properties:
Ease of programming
Optimization opportunities
Extensibility
Example
Pig
Input: User Profiles, Page Visits

Task:
Find the top 5 websites that have been visited by users aged 18 to 25 years
Users =
load
‘users’
as
(name, age);
Filtered =
filter
Users
by
age >= 18 and age <= 25;
Pages =
load
‘pages’
as
(user, url);
Joined =
join
Filtered
by
name, Pages
by
user;
Grouped =
group
Joined
by
url;
Summed =
foreach
Grouped
generate
group,
COUNT(Joined)
as
clicks;
Sorted =
order
Summed
by
clicks
desc;
Top5 =
limit
Sorted 5;
store
Top5
into
‘top5sites’;
Pig - Code
With Oozie, it is possible to combine different Hadoop jobs in a workflow (eg. Scoop, MR, Hive and Pig jobs)
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.
Directly into Hadoop usable (ready Jars)
Classification
Produces free implementations of distributed or otherwise scalable machine learning algorithms
Focused primarily in the areas of collaborative filtering,clustering and classification
Provides Java libraries for common math (focused on linear algebra and statistics) operations and primitive Java collections.
Distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
Not only restricted to log data aggregation.Can also be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.
Splitting Criteria
The Existing Splitting Criteria
The Onion Analogy
Problem Statement
Analysing the current splitting techniques being used in Apache Hadoop and then devising our own new splitting techniques and comparing them with each other as well as the existing techniques on the algorithms of Association Rule Mining(ARM) in terms of efficiency based on various types & sizes of data.
Full transcript