Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Apache Hadoop

No description

Yuriy Senko

on 20 September 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Apache Hadoop

Hadoop is about scalability not performance Apache Hadoop
MapReduce Yuriy Senko What is Hadoop? Elephant
Framework that supports data-intensive distributed applications MapReduce Simple data-parallel programming model designed for scalability and fault tollerance.
Pioneered by Google (process ~20 petaBytes/day)
Popularized by opensource Hadoop project (currently used by Yahoo, Facebook, Amazon... MapReduce common design patterns in data processing cat * | grep | uniq | sort | cat > file input | map | shuffle | reduce | output Natural for: Log processing
Web search indexing
Ad-hoc queries Goals of HDFS Very large distributed filesystem An ability to store files of any size
(in average 10K nodes, 100 millions files, 10PB of data) Failures handling Files are replicated to handle hardware failures.
Detect failures and recovers from them. Optimized for batch processing Move computation to data
Direct read from slave nodes
Provides very high aggregate bandwidth Hadoop Distributed File System Each file split into blocks.
Blocks replicated across several datanodes (usually 3).
Files in HDFS are "write once" (no random writes to files are allowed).
Single NameNode stores metadata (filename, block location, etc.)
Each DataNode sends a Heartbeat message to the NameNode periodically.
Optimized for large files, sequential read How it works? Examples Q&A Hadoop scales, and it scales linearly. It scales in storage capacity, and in compute capacity.

Along with scaling, it is fault tolerant in the face of hardware failures. One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark, which was created in 1998 by Jim Gray, specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk. This is the first time that either a Java or an open source program has won Typical cluster configuration 910 nodes
2 quad core Xeons @ 2.0ghz per a node
4 SATA disks per a node
8G RAM per a node
1 gigabit ethernet on each node
40 nodes per a rack
8 gigabit ethernet uplinks from each rack to the core
Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
Sun Java JDK 1.6.0_05-b13 Hadoop parts HDFS
MapReduce 1 #!/usr/bin/env python
3 import sys
4 import re
6 WORD_PATTERN = re.compile(r'\b[A-Za-z]+\b')
9 def _get_words(from_=sys.stdin):
10 """Read `from_` line by line and yield a list of words which contain
11 alphabetical characters only.
12 """
13 for line in from_:
14 yield WORD_PATTERN.findall(line)
16 def _print_results(words):
17 """Print a list of words in the following format:
19 `word_1`\t1\n
20 ...
21 `word_n`\t1\n
22 """
23 for word in words:
24 print '%s\t1' % (word.lower(),)
26 def main():
27 """Read the line from sys.stdint and process it.
28 """
29 for words in _get_words():
30 _print_results(words)
33 if __name__ == '__main__':
34 main()
35 Wordcount map
(Hadoop streaming) Wordcount reduce
(Hadoop streaming) 1 #!/usr/bin/env python
3 from operator import itemgetter
4 import sys
7 def main():
8 """Read the data from sys.stdin and process theme.
9 """
10 current_word = None
11 current_count = 0
13 for line in sys.stdin:
14 word, count = line.split('\t')
15 try:
16 count = int(count)
17 except ValueError:
18 continue
20 if current_word == word:
21 current_count += count
22 else:
23 if current_word is not None:
24 print '%s\t%s' % (current_word, current_count)
26 current_word = word
27 current_count = count
29 if current_word is not None:
30 print '%s\t%s' % (current_word, current_count)
33 if __name__ == '__main__':
34 main()
35 MapReduce is fault tolerance If a node fails, the master will detect that failure and re-assign the work to a different node on the system
Restarting a task does not require communication with nodes working on other portions of the data
If a failed node restarts, it is automatically added back to the system and assigned new tasks
If a node appears to be running slowly, the master can redundantly execute another instance of the same task. Results from the first to finish will be used So what about Python? Why Pydoop? No need to use Hadoop Streaming or Jython Easy Hadoop scripting
plus HDFS API. http://pydoop.sourceforge.net/ How to try Hadoop? 1. Download ready VirtualBox/VMWare/etc. image from Cloudera.

2. That's all. HowManyMapsAndReduces Maps The number of maps is usually driven by the number of DFS blocks in the input files.

Actually controlling the number of maps is subtle. The default behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. Reduces The right number of reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. At 1.75 the faster nodes will finish their first round of reduces and launch a second round of reduces doing a much better job of load balancing. (Hadoop wiki) hg clone https://ChakaBum@bitbucket.org/ChakaBum/hadoop_map_red

https://bitbucket.org/ChakaBum/hadoop_map_red What is MapReduce used for? More than 100,000 CPUs in >40,000 computers running Hadoop Our biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)

Used to support research for Ad Systems and Web Search

Also used to do scaling tests to support development of Hadoop on larger clusters What is MapReduce used for? A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
A 300-machine cluster with 2400 cores and about 3 PB raw storage. We are heavy users of both streaming as well as the Java APIs. We have built a higher level data warehousing framework using these features called Hive (see the http://hadoop.apache.org/hive/). We have also developed a FUSE implementation over HDFS.
We are heavy users of both streaming as well as the Java APIs. We have also developed a FUSE implementation over HDFS. What is MapReduce used for? We use Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. We use Cloudera's CDH2 distribution of Hadoop, and store all data as compressed LZO files.
Full transcript