Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Using PBS and myHadoop to Schedule Hadoop MapReduce Jobs Acc

No description

Jeffrey Denton

on 31 March 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Using PBS and myHadoop to Schedule Hadoop MapReduce Jobs Acc

Using PBS and myHadoop to Schedule Hadoop MapReduce Jobs Accessing a Persistent OrangeFS Installation
Enable researchers at Clemson to run Hadoop MapReduce Jobs in a PBS scheduled environment leveraging an existing OrangeFS installation.
What is Apache Hadoop?
What is OrangeFS?
What is myHadoop?

"The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing."

What is Apache Hadoop?
OrangeFS is the next generation of PVFS, an open-source, distributed, parallel file system.
What is OrangeFS?
"myHadoop is a simple system for end-users to provision Hadoop instances on traditional supercomputing resources, without requiring any root privileges. Users may use myHadoop to configure and instantiate Hadoop on the fly via regular batch scripts."

What is myHadoop?
Hadoop's abstract FileSystem class
Configuration file sets designated file system
Hadoop written in Java
OrangeFS is written in C
Bridging the gap
A Java Native Interface (JNI) shim allows data to be passed between programs
OrangeFS with Hadoop MapReduce
Allows Java methods to execute C functions present in the OrangeFS Direct Client Interface
Avoids memory copies between Java and C by using Java's NIO Direct ByteBuffer
OrangeFS Direct Client Interface Library
collection of familiar POSIX-like and system standard input/output (stdio.h) library calls designed for parallel access to OrangeFS.
OrangeFS JNI Shim
Source: Mapreduce diagram, taken from "MapReduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat
Hadoop Distributed File System - HDFS
Source: http://hadoop.apache.org/docs/stable/images/hdfsarchitecture.gif
Source: http://blog.raremile.com/hadoop-demystified
Client Perspective
Two Different Storage Approaches
Distinguishing Features
Based on PVFS
Multiple metadata servers
Supports full R/W anywhere within a file
Written in C
Replication coming in v3.0
Open-source LGPL
Paid support offered through Omnibond
Active community and developers list
Diverse selection of client interfaces
Dedicated (remote) storage
HDFS Architecture
Supported OS:
Mac OS X
OrangeFS Client Interfaces
Other interfaces:
Direct Interface
Comparison between HDFS and OrangeFS
myHadoop - How does it work?
User requests n nodes
File containing a list of compute nodes assigned by PBS for a user's requested job
First node assigned as Hadoop JobTracker and NameNode, also functions as TaskTracker and DataNode
Remaining nodes are assigned as TaskTrackers and DataNodes only

Copies default configuration from Hadoop conf directory to desired job configuration directory
Uses 'sed' (stream editor) to replace the template placeholder of "MASTER" with "node1234" for both JT and NN
Copies $PBS_NODEFILE to $CONFIG_DIR/slaves
Appends desired environment variables to $CONFIG_DIR/hadoop-env.sh

For all Hadoop slaves:
ssh to slaves
Remove any previous log/data directories
Re-create them for current run
etc/core-site.xml - template file, populated with first hostname in $PBS_NODEFILE
etc/mapred-site.xml - template file, populated with first hostname in $PBS_NODEFILE

etc/hdfs-site.xml - file filled with reasonable HDFS options
Not really a template, since it isn't based on allocated resources
myHadoop overwrites the default hdfs-site.xml with this customized one
bin/setenv.sh - Define required environment variables
bin/pbs-configure.sh - Prepare compute nodes and configuration files for Hadoop MR job(s).
bin/pbs-cleanup.sh - Cleanup after MR job completes
pbs-example.sh - An example PBS scheduled MR job using the myHadoop scripts and template configuration files
myHadoop - How does it work?
Template configuration files:
bin/pbs-configure.sh (continued):
myHadoop - How does it work?
myHadoop - How does it work?
bin/pbs-configure.sh (continued):
myHadoop - How does it work?
myHadoop Customization for OrangeFS
For all Hadoop slaves:
ssh to slaves
myHadoop - How does it work?
No need to handle HDFS data directories, but must handle $HADOOP_TMP_DIR
Path to OrangeFS shared libraries which Hadoop MapReduce will utilize
Specifies which OrangeFS server the client should contact initially
myHadoop Customization for OrangeFS
Removed myHadoop's persist functionality since OrangeFS data always persists
sed is used to force IPoIB connectivity for MapReduce
sed -i 's/.lakeside/-ib0.lakeside/g' $CONFIG_DIR/slaves
node0001.lakeside.clemson.edu -> node0001-ib0.lakeside.clemson.edu
Removed dynamic setup for HDFS since dedicated OrangeFS is static
Appended OrangeFS related environment variables to hadoop-env.sh
echo "export HADOOP_CLASSPATH=$JNI_LIBRARY_PATH/ofs_hadoop.jar:$JNI_LIBRARY_PATH/ofs_jni.jar" >> $CONFIG_DIR/hadoop-env.sh
myHadoop Customization for OrangeFS
Convenience Scripts:
Starts all MapReduce and HDFS daemons
Start only MapReduce daemons
There are "stop" scripts too
myHadoop Customization for OrangeFS
Create config output directory
Run Hadoop MapReduce Job(s)
git clone https://github.com/nevdullcode/myHadoop-orangefs
myHadoop Customization for OrangeFS
GitHub Repository
OrangeFS performs comparably to HDFS using traditional architecture
Dedicated (remote storage) shows ~25% improvement over HDFS
Dedicated Test: Increasing Hadoop MR clients
Co-located processing and storage
This product includes software developed by the San Diego Supercomputer Center.
Full transcript