Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Introduction to BIG data analytics with Hadoop

Presented over a regular talk - Google Developer Group
by

Gautam Anand

on 13 May 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Introduction to BIG data analytics with Hadoop

Part 1 :Understanding BigData Analytics Every day, we create 2.5 quintillion bytes of data (2.2TB) — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. -IBM 4 dimensions of BigData Volume Variety Velocity 1 Veracity 2 3 4 Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes—of information. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value. Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together. 1 in 3 business leaders don't trust the information they use to make decisions. How can you act upon information if you don’t trust it? Establishing trust in big data presents a huge challenge as the variety and number of sources grows. Conclusion ? Big data is more than simply a matter of size; it is an opportunity to find insights in new and emerging types of data and content, to make your business more agile, and to answer questions that were previously considered beyond your reach. Points to remember Everyone is interested in collecting data.
Real time Analytics of this large data is characteristic of BigData.
Handling structured ,semi structured & unstructured data .
Effective visualization of nuggets How is this different from Large data analysis to which it is often confused with ? For large data analysis - one doesn't care about the realtime analytics .Its usually done by experts in their convince of time .
This is "unproductive" as the rate of inflow of data is huge and analysis result are delivered very late .In some cases analysis is available even after the closure of the event.
Opportunity : Every sector organization is either collecting/planning to collect data but they don't have any "Automated software solution" to help them analyze ,view realtime results . They would just want to hire "Data scientist " otherwise . present solutions Hortonworks Data platform Mapreduce & HDFS :Distrbuted components for processing & Analysis .[BI/Data Mining ]
PIG & HIVE : DB Query
Integration services : Port data from external sources via API's
Hbase : Non-Sql
Oozie,Ambari :Data Managment

"Impressive" Scope There are multiple uses for big data in every industry – from analyzing large volumes of data than was previously possible to drive more precise answers, to analyzing data in motion to capture opportunities that were previously lost. A big data platform will enable your organization to tackle complex problems that previously could not be solved. Part2 :Quick Technology to get you started ? Distributed Computing : MapReduce & HDFS Lets see an example : BigData in Energy & Utility industry What is Distributed computing ? Mapreduce How is it a saviour in BigData approach ? Distributed computing is a field of computer science that studies distributed systems.

A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal. In distributed computing, a problem is divided into many tasks, each of which is solved by one or more computers.

In parallel computing, all processors may have access to a shared memory to exchange information between processors.
In distributed computing, each processor has its own private memory (distributed memory). Information is exchanged by passing messages between the processors Hadoop distributed file system HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework.

Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster.

The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS.

The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other.

HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware Hardware Failure :Some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

Streaming Data Access :Applications that run on HDFS need streaming access to their data sets. HDFS is designed more for batch processing rather than interactive use by users.

Large Data Sets : Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to thousands of nodes in a single cluster. It should support tens of millions of files in a single instance.


Appending-Writes and File Syncs : Most HDFS applications need a write-once-read-many access model for files. HDFS provides two additional advanced features: hflush and append. Hflush makes the last block of an unclosed file visible to readers while providing read consistency and data durability. Append provides a mechanism for opening a closed file to add additional data. n HDFS takes the file and breakes into equal sized files

an ideal file size is a multiple of 64 MB
m is broken into n files .Size of m is 64mb.
Block = n *512kb
Storage size of n is 512kb 2 1 Once the split is made it is categorized as "Namenode"(Unique for a block) or "Datanode".It is then replicated and saved in various clusters across a server farm.

"Application never reads via namenode" .Namenode just updates the system.App reads via datanode .This is prevent loss of mapping info when namenode fails.

In Enterprise Hadoop ,their is a "Backup Namenode" When a Datanode fails ?

Namenode finds this ,figures the damages ,store to a new location a new copy and updates this as new datanode (i.e replaces damage datanode) Map reduce has two components :

Mapper phase - All the datanodes will do the same computation at the same time.But this is done by different datanodes in in a same cluster .

If all the block size are not of same size then they would then the mapper phase will be dictated by the biggest datanode.

The values are taken in <key,value> .Key will be the asked function value and the value will be the result of the same.


Reduce Phase - Once the mapper phase has executed the function ,this phase will reduce or combine <key,value> pairs from various datanode in one compilation .

Ex : The reduce phase will take 3 functions <key1,value1> <key2,value2><key3,value3> .

Both of the phase will be executed sequentially.F(M)-S(R)


Ex : Find the repeated words "Green" "orange" in a pool of data .Suppose we divided the data into 3 blocks(m=3) and it was futher subdived into various n blocks.Every m block will have 3 datanodes .We are not getting into detail about the n node results at the moment .

Each datanode will return : <green,1> ; <green,2> ; <green,3>
<orange,2> ; <orange,1> ; <orange,3>

At combine phase ,the values will be sorted .After that in reduce phase it will be joined .The results look like this :

<green,1+2+3>
<orange,2+1+3>

or

<green,6>
<orange,6> Fastens the analysis process ,makes it possiable to match with the various visulization standards to realtime results .

So we can see improved relatime analytics with Apache Hadoop .
Full transcript