Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

MapReduce by Kailash

This is an attempt to explain MapReduce
by

Kailashnath Kutti

on 12 March 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of MapReduce by Kailash

(cc) image by anemoneprojectors on Flickr Find out how many times each word
repeated in every file File 1
EGIT is an IT arm of Emirates Airline
Emirates is a fast growing Airline
EGIT is learning Hadoop
Hadoop is a MapReduce framework Result public class CountWords
{
MainProgram(InputFiles)
{
foreach(File f in InputFiles)
{
if(f.contains(searchTerm))
results.add(f.getFileName());
}
}
System.out.println(“Files:” + results.toString());
} What if you have to search in EG-IT Sharepoint sites and prepare a report on all abusive words used? public void map() {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
if(itr.countTokens() >= N) {
while(itr.hasMoreTokens()) {
word = itr.nextToken()+“|”+key.getFileName();
output.collect(word, 1);
}
}
} Master
1.Load file 1
2. Read all lines from file
3. Create chunks
4. Send chunk 1 to Node A with Map Code
5. Send chunk 2 to Node B with Map Code
6. Send chunk 3 to Node C with Map code
7. Send chunk 4 to Node c with Map Code Node 1
Do Map Function
Return Node 2
Do Map Function
Return Node 4
Do Map function
Return Node 3
Do Map Function
Return Phase <EGIT, 1>
<is,1>
<an,1>
<IT,1>
<arm,1>
<of,1>
<Emirate,1> <Emirates,1>
<is,1>
<a,1>
<fast,1>
<growing,1>
<Airline,1> <EGIT,1>
<is,1>
<learning,1>
<Hadoop,1> <Hadoop,1>
<is,1>
<a,1>
<MapReduce,1>
<Framework,1> HDFS HDFS Reduce Reduce Node1 Node 2 Reduce Node 3 Reduce Node 4 - Read a line of text
- Tokenize the line of text
- Create a new key value pair by making the token as the key and 1 as value run reduce on every chunk ? Store the result to a DB
Read the data from java program
Calculate word count public void reduce() {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, sum);
} About to end Phase End of Phase Phase M M R M HDFS end R m r File 1
EGIT is an IT arm of Emirates Airline
Emirates is a fast growing Airline
EGIT is learning Hadoop
Hadoop is a MapReduce framework HDFS Node 1
Map line 1 Node 2 Map line 2 Node 3
Map line 3 Node 4
Map lline 4 send every line to mappers HDFS --- collect intermediate results --- Node 1
Reduce based on given key "EGIT" Node 2
Reduce based on the given key "Emirates" And so on..... HDFS Collect reducer results from nodes and write them to out put Khalas! When you code MapReduce in Hadoop, you may not find all these steps as is. MapReduce Fundas
A functional programming paradigm
Whole program will be divided into Map and Reduce
Requires a scalable storage
Best implementation in Java is Apache Hadoop <EGIT, 1>
<is,1>
<an,1>
<IT,1>
<arm,1>
<of,1>
<Emirate,1>
<Emirates,1>
<is,1>
<a,1>
<fast,1>
<growing,1>
<Airline,1> <EGIT,1>
<is,1>
<learning,1>
<Hadoop,1> <Hadoop,1>
<is,1>
<a,1>
<MapReduce,1>
<Framework,1> - Items are sorted
- Send values for a
particular key
to a node for reducing Read Key value pair
For every key sum the value Business logic MPP Technique for large scale data processing
Keep business logic along with data
1. Split the data into smaller units
2. Send them to independent machines
3. Send business logic along with the data
4. Run the jobs simultaneously
5. Collect and collate the results
Full transcript