Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Hadoop Ecosystem

No description
by

Shen Li

on 5 April 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Hadoop Ecosystem

Hadoop Ecosystem
and
Cloud OS

design by Dóri Sirály for Prezi
What's Hadoop?
Limitations in Hadoop

http://geospatial.blogs.com/geospatial/2012/11/trends-in-data-analytics-for-utilities.html


GE is a registered trademark of General Electric Company
http://www-01.ibm.com/software/data/bigdata/industry.html
http://blog.schneider-electric.com/smart-grid/2013/03/18/the-impact-of-big-data-on-energy-and-sustainability/

Motivation? Application?
Wikipedia
: Hadoop is an open-source software framework
for storage and large-scale processing of data-sets
on clusters of commodity hardware.
Apache
: The Apache Hadoop software library is a framework that allows
for the distributed processing of large data sets
across clusters of computers using simple programming models.
Cloudera:
The Hadoop platform was designed
to solve problems where you have a lot of data
— perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables.
Why Hadoop?
Scale
up
(vertical)
vs
Scale
out
(horizontal)
How Does Hadoop Work?
Data Storage:
Data Processing:
Hadoop System Design
HBase
Hive, Pig
Oozie, WOHA
Yarn
Master node is a single point of failure and might be a performance bottleneck.
Lack the support for pipelined/coordinated data processing.
Too much effort to issue simple queries.
Does not make use of structures in data layout.
Hadoop Ecosystem
part of
HBase
An open-source implementation of Google's BigTable

Offers supports for structured data

Logically a 3D table, but technically a
key-value store
(row:string, column:string, time:int64) string
What's the benefits of the design?(compared to row-based databases)
can be easily
distributed
.
may efficiently use spaces even when the table is
sparse
.
High read/write
throughput
.
Hive
Supports SQL-like scripting language called HiveQL
It uses a relational database to store its metadata, but the data to be processed is stored in HDFS
HBase vs Hive
HBase improves the read/write throughput on structured data.
Hive translates SQL-like queries into Hadoop jobs.
It converts HiveQL queries into MapReduce Jobs
Oozie
It is a waste of efforts to write a different controller for each workflow.
Oozie Cont.
How Oozie defines a pipeline?
Workflow.
Coordinator.
Bundle.
Workflow
defined as DAG of jobs.
Each job can be hadoop job, java code, pig job, and etc.
Decision conditions can be numbers of files, file age, and etc.

Oozie Cont.
Coordinator
an wrapper of workflow that allows Oozie to execute workflow based on
Time dependency (frequency)
Data dependency

Bundle
Packs multiple Coordinators into one configuration file.
Simplifies job submission.

YARN Cont.
YARN
Yet Another Resource Negotiator
How to offload burdens from Hadoop Master node?
Hadoop Master Node
offloads
offloads
Remaining Logic
Application Master
Resource Manager
Node Manager
Sits on top of HDFS
The logic of Hadoop Master node is offloaded into Application Masters (AM) and the Resource Manager (RM).
Much more flexible resource unit (the container)
Separates global resource management with application logic.
Separates global resource management with application logic.
What does Microkernel OS offer?
low-level address space management,
thread management,
inter-process communication (IPC)
The Hadoop Ecosystem is approching a Cloud OS!
Recap
How does Hadoop work? Can you give an example to demonstrate its data processing main steps?
What are the limitations of Hadoop?
What are the differences between HBase and Hive?
How does YARN compare to Hadoop v1?
How many servers may die before we lose data?
How to distribute a row-based database?
Map:
applies a given function element-wise to a list of elements and
returns a list of results
A Simple Example:
– L = (1, 2, 3, 4, 5);
– f : Multiply an element by two;
– Map(f, L) returns (2, 4, 6, 8, 10).
Reduce:
deals with a combining function and a list of elements of some
data structure. The Reduce then proceeds to combine elements of the data structure using the function in some systematic way.
A Simple Example:
– L = (1, 2, 3, 4, 5);
– f : add two elements;
– Reduce(f, L) apply f to L recursively,returns 15.
What about
sorting
using Hadoop?
Can you show an example that sorts the following list with 2 mappers and 2 reducers?

(4, 2, 3, 7, 9, 1, 8, 6, 5, 0)
Benefits?
Fault tolerence
Pricing
Costs?
complicate software
networking
maintenance
(4, 2, 3, 7, 9)
(1, 8, 6, 5, 0)
M1
M2
(2, "")(3, "")(4, "")

(7, "")(9, "")
(0, "")(1, "")
(5, "")(6, "")(8, "")
R1
R2
Benefits?
Is there any problem with "Sharding"?
What if some shard grows too large and some shrinks too small?
What if you would like to add 10 more servers?
Hive requires Hadoop-HDFS and Hadoop-MapReduce, while HBase only need HDFS.
Typical use case of Hadoop?
Perform data sampling before submit the sorting job.
http://sortbenchmark.org/YahooHadoop.pdf
presented by Shen Li
http://web.engr.illinois.edu/~shenli3/
http://web.engr.illinois.edu/~shenli3/papers/woha.pdf
Scheduling in Hadoop Cluster
Default FIFO Scheduler
Facebook Fair Scheduler
Resource Pools
Full transcript