Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Transcript of HADOOP
Big Data ?
What is Hadoop ?
Software framework for easily writing applications which process vast amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks.
Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
What is Big Data?
Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured.
Why Is Big Data Important?
And big data may be as important to business – and society – as the Internet has become. Why? More data may lead to more accurate analyses. More accurate analyses may lead to more confident decision making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk.
Problems Associated with Big Data
Data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.
The Ox Analogy
Hadoop : Free, Java-based programming framework
Supports the processing of large data sets in a distributed computing environment.
Part of the Apache project sponsored by the Apache Software Foundation.
Hadoop Distributed File System (HDFS) : Self-healing cluster storage with very high data throughput.
HDFS (storage) and MapReduce (processing) are the two core components of Apache Hadoop.
A small Hadoop cluster will include a single master and multiple worker nodes.
The master node consists of a
A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes.
The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible.
A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.
Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept.
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
SQL-like Interface for HDFS
Map & Reduce
Data warehouse system for Hadoop
Easy data summarization
Analysis of large datasets stored in Hadoop compatible file systems.
Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Processing - Script - Engine
Machine Learning Framework
Workflow - Scheduler
HBase doesn't stand for "Hadoop Database"
A "NoSQL" data store
Can store very large amounts of data(Gigabytes, terabytes in a table)
Scales with very high data throughput(approx. 100,000 inserts / second)
Is very good "sparse data"
Tables can have many thousands of columns; zero values do not occupy any space
Has a simple access model:
Insert, read, delete rows, partial or full table scans
Only one column (the 'row' key) is indexed
(stock.id = orders.stock_id)
YEAR(orders.order_date) = 2012
List of items with quantity of sales
High-level platform for creating MapReduce programs used with Hadoop.
Language used Pig Latin.
Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems.
Ease of programming
Input: User Profiles, Page Visits
Find the top 5 websites that have been visited by users aged 18 to 25 years
age >= 18 and age <= 25;
Pig - Code
With Oozie, it is possible to combine different Hadoop jobs in a workflow (eg. Scoop, MR, Hive and Pig jobs)
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.
Directly into Hadoop usable (ready Jars)
Produces free implementations of distributed or otherwise scalable machine learning algorithms
Focused primarily in the areas of collaborative filtering,clustering and classification
Provides Java libraries for common math (focused on linear algebra and statistics) operations and primitive Java collections.
Distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
Not only restricted to log data aggregation.Can also be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.
The Existing Splitting Criteria
The Onion Analogy
Analysing the current splitting techniques being used in Apache Hadoop and then devising our own new splitting techniques and comparing them with each other as well as the existing techniques on the algorithms of Association Rule Mining(ARM) in terms of efficiency based on various types & sizes of data.