Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Using PBS and myHadoop to Schedule Hadoop MapReduce Jobs Acc

No description
by

Jeffrey Denton

on 31 March 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Using PBS and myHadoop to Schedule Hadoop MapReduce Jobs Acc

Using PBS and myHadoop to Schedule Hadoop MapReduce Jobs Accessing a Persistent OrangeFS Installation
Motivation
Enable researchers at Clemson to run Hadoop MapReduce Jobs in a PBS scheduled environment leveraging an existing OrangeFS installation.
Background
What is Apache Hadoop?
What is OrangeFS?
What is myHadoop?

"The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing."

hadoop.apache.org
What is Apache Hadoop?
OrangeFS is the next generation of PVFS, an open-source, distributed, parallel file system.
http://www.orangefs.org
What is OrangeFS?
"myHadoop is a simple system for end-users to provision Hadoop instances on traditional supercomputing resources, without requiring any root privileges. Users may use myHadoop to configure and instantiate Hadoop on the fly via regular batch scripts."

http://sourceforge.net/projects/myhadoop/
What is myHadoop?
Hadoop's abstract FileSystem class
Configuration file sets designated file system
Hadoop written in Java
OrangeFS is written in C
Bridging the gap
A Java Native Interface (JNI) shim allows data to be passed between programs
OrangeFS with Hadoop MapReduce
Allows Java methods to execute C functions present in the OrangeFS Direct Client Interface
Avoids memory copies between Java and C by using Java's NIO Direct ByteBuffer
OrangeFS Direct Client Interface Library
collection of familiar POSIX-like and system standard input/output (stdio.h) library calls designed for parallel access to OrangeFS.
OrangeFS JNI Shim
Hadoop
Source: Mapreduce diagram, taken from "MapReduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat
MapReduce
Hadoop Distributed File System - HDFS
Source: http://hadoop.apache.org/docs/stable/images/hdfsarchitecture.gif
Source: http://blog.raremile.com/hadoop-demystified
Client Perspective
Two Different Storage Approaches
Distinguishing Features
Based on PVFS
Multiple metadata servers
Supports full R/W anywhere within a file
Written in C
Replication coming in v3.0
Open-source LGPL
Paid support offered through Omnibond
Active community and developers list
Diverse selection of client interfaces
Dedicated (remote) storage
HDFS Architecture
MapReduce
Supported OS:
Linux
Mac OS X
Windows
OrangeFS Client Interfaces
Other interfaces:
Direct Interface
WebDAV
S3
REST
FUSE
MPI-IO
Hadoop
Comparison between HDFS and OrangeFS
myHadoop - How does it work?
User requests n nodes
PBS creates $PBS_NODEFILE
File containing a list of compute nodes assigned by PBS for a user's requested job
First node assigned as Hadoop JobTracker and NameNode, also functions as TaskTracker and DataNode
Remaining nodes are assigned as TaskTrackers and DataNodes only

Copies default configuration from Hadoop conf directory to desired job configuration directory
Uses 'sed' (stream editor) to replace the template placeholder of "MASTER" with "node1234" for both JT and NN
Copies $PBS_NODEFILE to $CONFIG_DIR/slaves
Appends desired environment variables to $CONFIG_DIR/hadoop-env.sh

For all Hadoop slaves:
ssh to slaves
Remove any previous log/data directories
Re-create them for current run
etc/core-site.xml - template file, populated with first hostname in $PBS_NODEFILE
etc/mapred-site.xml - template file, populated with first hostname in $PBS_NODEFILE


etc/hdfs-site.xml - file filled with reasonable HDFS options
Not really a template, since it isn't based on allocated resources
myHadoop overwrites the default hdfs-site.xml with this customized one
bin/setenv.sh - Define required environment variables
bin/pbs-configure.sh - Prepare compute nodes and configuration files for Hadoop MR job(s).
bin/pbs-cleanup.sh - Cleanup after MR job completes
pbs-example.sh - An example PBS scheduled MR job using the myHadoop scripts and template configuration files
Scripts:
myHadoop - How does it work?
Template configuration files:
bin/pbs-configure.sh (continued):
myHadoop - How does it work?
bin/pbs-configure.sh:
myHadoop - How does it work?
bin/pbs-configure.sh (continued):
myHadoop - How does it work?
myHadoop Customization for OrangeFS
For all Hadoop slaves:
ssh to slaves
Remove $HADOOP_DATA_DIR
Remove $HADOOP_LOG_DIR
bin/pbs-cleanup.sh:
myHadoop - How does it work?
bin/setenv.sh
Define $HADOOP_TMP_DIR
No need to handle HDFS data directories, but must handle $HADOOP_TMP_DIR
Define $JNI_LIBRARY_PATH
Path to OrangeFS shared libraries which Hadoop MapReduce will utilize
Define $PVFS2TAB_FILE
Specifies which OrangeFS server the client should contact initially
myHadoop Customization for OrangeFS
bin/pbs-configure.sh
Removed myHadoop's persist functionality since OrangeFS data always persists
sed is used to force IPoIB connectivity for MapReduce
sed -i 's/.lakeside/-ib0.lakeside/g' $CONFIG_DIR/slaves
node0001.lakeside.clemson.edu -> node0001-ib0.lakeside.clemson.edu
Removed dynamic setup for HDFS since dedicated OrangeFS is static
Appended OrangeFS related environment variables to hadoop-env.sh
echo "export HADOOP_CLASSPATH=$JNI_LIBRARY_PATH/ofs_hadoop.jar:$JNI_LIBRARY_PATH/ofs_jni.jar" >> $CONFIG_DIR/hadoop-env.sh
myHadoop Customization for OrangeFS
bin/pbs-orangefs-example.sh
Convenience Scripts:
$HADOOP_PREFIX/bin/start-all.sh
Starts all MapReduce and HDFS daemons
$HADOOP_PREFIX/bin/start-mapred.sh
Start only MapReduce daemons
There are "stop" scripts too
myHadoop Customization for OrangeFS
bin/pbs-orangefs-example.sh
bin/setenv.sh
Create config output directory
bin/pbs-configure.sh
$HADOOP_PREFIX/bin/start-mapred.sh
Run Hadoop MapReduce Job(s)
$HADOOP_PREFIX/bin/stop-mapred.sh
bin/pbs-cleanup.sh
git clone https://github.com/nevdullcode/myHadoop-orangefs
myHadoop Customization for OrangeFS
GitHub Repository
OrangeFS performs comparably to HDFS using traditional architecture
Dedicated (remote storage) shows ~25% improvement over HDFS
Dedicated Test: Increasing Hadoop MR clients
Co-located processing and storage
This product includes software developed by the San Diego Supercomputer Center.
Full transcript