Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

EECE 417

Hadoop Distributed File System
by

Jack Wu

on 23 December 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of EECE 417

Hadoop Distributed File System
Jack Wu
Catherine Wang
EECE 417
Son, what's your stuffed elephant's name?
It's
HADOOP
!
What is Hadoop?
Too many files!
Organized into clusters
Each cluster has a name node and at least 1 data node
Putting commodity cores, disks, and memories back into a distributed network
Indexing
Each file block is replicated at the datanode
BIG
data storage
MapReduce
Hadoop provides a framework that faciliates mapreduce paradigm through pipeline architecture and data allocation
Architecture highlights
Replicated blocks
Periodic block reports
Central meta data server
Avoid complicated locking system by keeping meta data in memory
Advantages
Scalable & Economic
Tolerable
Extensible
Distributed
Executes the reduce phase
Executes the mapper phase
Sends commands to TaskTrackers
Find me the word 'Hello'!
I found 'Hello' 3 times
I found 'Hello' 2 times
Thanks guys! Let me put'em all together
Backup Node
Keeps locations of where file blocks are on datanodes
Delegates tasks for datanodes
Holds file blocks and block replications
Multiple data nodes execute one job tasks in parallel
Get to work!
EditLog
FsImage
Periodical report from datanodes to namenode
Writing to memory ......
Writing to memory ......
Writing to memory ......
Done!
Creating new checkpoint ......
Updating FsImage ......
65
000
000
40
000
Who has what I want?
They all have it
You are closest to me. Give me block A
Here is what you asked for
HDFS client write
Who will keep my file?
Give it to them, each of them will have one copy
Here is my file, pass it please!
By default, replication factor is Three...
How does the master and slave model tie into Hadoop?
What would happen if the namenode fails?
a. Allow a backup node to restore the namenode
b. Halt entire system, manual namenode replacement
c. Datanodes can takeover the tasks of a namenode
Decommission node and stop sending requests
What happens if the connection is severed between namenode and datanode?
What happens if there is a corrupted file block on a datanode?
a. Keep sending block requests to datanode until Exception occurs
b. No more request sent due to no block report received
add slide for fs comparison
centralized vs distributed
Data Received
Data is written to the datanodes based on topological distance
Reading ......
Map Phase
Reduce Phase
MapReduce Framework
V.S.
Name Node
Data Node
A block is over-replicated, if
high
in demand
Datanodes are pipelined based on the topological distance from client
What happens if there are multiple write to the same file?
Centralized FS Server
Full transcript