Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

What is Hadoop ?

No description
by

narendra babu

on 13 March 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of What is Hadoop ?

What is Hadoop ?






THANK YOU

Hadoop Architecture Flow

hdfs Architecture

Job Tracker ( Master/Scheduler ) :

Manage computation processing ( MapReduce jobs )
Distribute tasks across HDFS cluster ( data nodes )
Track Map/Reduce tasks
Manage task failure - restart failed tasks on different nodes
Speculative execution

Task Tracker ( Slave/Task Execution ):

Creates individual map and reduce tasks
Track Individual map and reduce tasks
Report task status/progress to Job Tracker

Job Tracker / Task Tracker


Runs on all slave nodes in the cluster

Client Access the blocks directly from data nodes

Block creation/replication/deletion/reads

First file : Contain data itself

Second file : Metadata of block ( checksums , creation time stamp )

Startup process and registration ( handshake with name node )

Periodically send heart beats and Block reports to Namenode

Receive instructions from Namenode

DataNode – Block Storage

Namenode – Secondary node interaction


Secondary Name node ( CheckPoint Node/Backup Node)

Maintains copy/snapshot of Namenode metadata

Gets editlogs from Namenode in regular intervals and applies to fsimage

Once new fsimage is created , it copies back to namenode

Namenode will use this fsimage for the next restart , which will reduce the startup time

Help minimize down time/data loss if namenode fails

Not a failover server for namenode


Secondary namenode

Components of Hadoop Cluster


Fault tolerant :
Stores/replicates files in blocks across many nodes in a cluster for durability

Self healing rebalances files across cluster

Scalable
just by adding new nodes

Move code/computation to data

Master/slave architecture

No file updates

Write once , read many times

Large blocks , sequential read patterns






Hadoop Distributed File System (HDFS)


25000 Machines

More than 10 clusters

3PB’s of data ( Compressed , unreplicated )

700+ users

10,000+ jobs per week

What’s Huge

Hadoop Environment

RDBMS vs Hadoop

The data has many valuable applications

Marketing analysis
Product recommendations
Demand forecasting
Fraud detection
And many many more ….


we must process it to extract that value

Why Hadoop ?


And we are generating data faster than ever

Automation
Ubiquitous internet connectivity
User generated content

For Example

Twitter processes 340 million messages
Amazon S3 storage adds more than one billion objects
Facebook users generate 2.7 billion comments and likes

Why Hadoop ?



Why Hadoop ?

What is Hadoop ?

RDBMS vs Hadoop

Hadoop Environment /Ecosystem

HDFS – Hadoop distributed file system

Hadoop components/daemons
Name Node , Secondary Name Node
Job tracker , task tracker and Data node

Writing /Reading files to HDFS




Agenda


Narendra Babu Chella







Hadoop distributed file system (HDFS)




Client -> Job tracker interaction

Namenode

Namenode ( Metadata server ):

Runs on single node as a master process

Bookkeeper for hdfs - Critical part of hdfs

Maintains metadata of each cluster ( which blocks are where )

Inode data :
Manages the file system namespace File I/O operations ( creating/opening/closing/renaming files/directories )
The hdfs namespace is a hierarchy of files/directories, represented on namenode by inodes

Block Management :
mapping of file blocks ( inodes ) to dataNodes ( Physical location of file data )
Tracks/Manages blocks in data nodes

Monitor datanode health

Replicate missing blocks

Keeps all namespace meta data in Memory

Provides client access to files in hdfs

Writes the modifications to file system ( edit logs )

At start up reads fsimage and merges with edit logs

Authorization and authentication

Doesn't store data or run jobs



Namenode

Hadoop Architecture



HDFS ( Data Storage )
Name Node / Secondary Name Node
Data Node

Map-Reduce ( data processing/computation )
Job Tracker
Task Tracker

Master Nodes
Name node
Secondary Name node
Job Tracker

Slave Nodes
Data Node/Task Tracker

Hadoop Components /Daemons


Yahoo – Spam detection , search indexing

Google – indexing , personalization

Facebook - Like analysis , recommendations , personalization

Twitter - tweet analysis

Amazon – user experience , personalization

NY Times – news feed analysis

Veoh – video preferences


Who uses hadoop

Hadoop Ecosystem

Hadoop is a platform

Fault tolerant ( stores 3 copies of each data set )

Batch/offline oriented , data and I/O intensive

Distributes and replicates data - hdfs

Manages parallel tasks created by users map-reduce

Move computation to data and not the other way

Written in Java , so runs on Linux, Windows, Solaris, and Mac OS/X




What is Hadoop ?


Handles unstructured or semi-structured to structured data

Handles enormous data volumes
Petabytes


Flexible data analysis and machine learning

commodity
hardware -
Cost effective scalability


Why Hadoop ?

We’re generating more data than ever

Financial transactions
Sensor networks
server logs
Analytics
email and text messages
Social media

Data generated from different device form factor ( PC’s , smart phones
, tablets)

Why Hadoop ?

Hadoop -
Open Source Apache Project
Frame work for large scale data processing
Framework for
reliably storing
&
processing petabytes
of data using
commodity hardware and storage
A cost effective, scalable way to :
Store massive data sets ( terabytes/petabytes)
perform arbitrary parallel analyzes on those data sets


Kernel of the distributed operating system for big data


Scalable solution
large scale data processing /analysis - Computation capacity
large scale data Storage
I/O bandwidth
Separation of distributed system fault tolerance code from application logic

Core Components

Hadoop Distributed File System - distributes data
Map/Reduce - distributes application processing and control

Hadoop ecosystem
Who Uses hadoop
Hadoop Daemons
Secondary NameNode
Full transcript