Prezi

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in the manual

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Apache Hama Introduction

Slides used to introduce Hama at ApacheCon 2013 Portland Barcamp
by Suraj Menon on 25 February 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Apache Hama Introduction

Programming in Hama Introduction to BSP Introduced by Leslie Valiant
Introduced message passing and global synchronization to tackle shared memory contention
Hama lets you express solutions using this model on HDFS (Hadoop Distributed File System)
Google Map-Reduce paper mentions BSP model but explains why they didn't pursue BSP
They did introduce Pregel though which is inspired from BSP. Apache Hama Introduction
Programming Model
Architecture
Current Challenges
Future Plans
Working Together Architecture Current Development for Hama 0.7 Future Directions Industrial Strength Fault tolerance.
Hierarchical BSP
Oblivious Synchronization
Integrate with HBase, Accumulo, etc.
Work on Mesos
Experiment new models for computation over Hama
Matrix Multiplications
Excited to see Tez in incubation Working Together We are looking for more contributors!

Useful links:
http://hama.apache.org/
http://wiki.apache.org/hama/GettingStarted
https://issues.apache.org/jira/browse/HAMA
http://hama.apache.org/mail-lists.html
Follow @ApacheHama on Twitter Bulk Synchronous Parallel
solutions over Hadoop Suraj Menon
(PMC, Committer) The existent API (as of 0.6) Groom Server Each unit task executing in parallel by a peer process is called Superstep
In a superstep the peers exchange messages with each other.
Then all the peers enter a synchronization barrier.
After coming out of the barrier the peers work on message sent to them in the previous superstep. When peer enters the barrier
synchronization mode, it is implied that :
Peer has completely executed the superstep.
All the messages for each of the others peers are (most of the times reliably) sent out.
Peer is waiting for all other peers to enter the barrier. When a peer leaves a barrier, it is implied that :
The peer is about to start working on the next superstep.
There are no other peers in the system working on the previous superstep.
A peer gets all the messages, sent to it by other peers in the previous superstep, as input for the new superstep execution. Pregel Workers send messages to each other and to the master node.
Used for graph algorithms where each worker is responsible for the state changes in vertices it holds.
The vertices exchange messages with each other based on the existent adjacent connectivity information.
Hama Graph module, Apache Giraph uses the above model for implementing graph algorithms. Master Task
Depending on the frequency of superstep designed, receives messages from all peers.
Aggregates the data received from all the workers and maintains a global state.
Broadcasts the global state to all the workers. Superstep API Job Submission public class MyClass extends BSP<K1, V1, K2, V2, M extends Writable> {
@Override
public void setup(..){
}

public void bsp(peer){
.......
.......
.......
peer.sync()
.......
.......
peer.sync()


peer.sync()

}

@Override
public void cleanup(..){
}
} while (peer.getCurrentMessage() != null) {
// process message

}

// Compute data
for (x: list of destination peers){
peer.send(peer.getPeerId(x), WritableMessage m));
} protected static class MySuperstep
extends
Superstep<K1, V1, K2, V2, M extends Writable> {

@Override
protected void compute(peer){
// Implementation
// No peer.sync required it is done by the
// framework after the compute function exits.
}
} Submit a series of Superstep classes.
Helps better reusability of code.
Chaining and loop constructs.
Required for Fault tolerance. Superstep API (Proposed) HamaConfiguration conf = new HamaConfiguration();

BSPJob bsp = new BSPJob(conf, MyBSP.class);
// Set the job name
bsp.setJobName("My BSP Example");
bsp.setBspClass(MyBSP.class);
bsp.setInputFormat(SequenceFileInputFormat.class);
bsp.setOutputKeyClass(Text.class);
bsp.setOutputValueClass(DoubleWritable.class);
bsp.setOutputFormat(SequenceFileOutputFormat.class);
job.setInputPath(in);
job.setOutputPath(out);

BSPJobClient jobClient = new BSPJobClient(conf);
bsp.setNumBspTask(10);
bsp.waitForCompletion(true); Job Submission code (similar to Hadoop MapReduce) Responsible for starting and stopping BSP Tasks
Maintaining the task execution status
Reporting the task status to BSP Master YARN adoption should help HAMA leverage the more sophisticated resource manager for scheduling Uses Zookeeper
for
co-ordination BSP Core


Graph Module


YARN Module


ML Module Messaging Scalability
- Spilling Queues
- Sorted Spilling Queue

Partitioning
- Better partitioning scheme can help reducing message exchange during program execution.

Finalizing Superstep API
- The less restrictive model implies the users would be provided with more responsibilities. We have to hit a sweet spot!

Asynchronous Messaging
- Currently messages are sent at the end of the superstep, asynchronous messaging during superstep should give us a more concurrent design. Current Focus YARN Hama has been YARN aware for sometime now
0.7 release is planned to run Hama with YARN scheduler
The implementation is not full-fledged yet.
Tested with few job-submissions
Much more code-refactoring to come. Machine Learning module Today contains:
- K-Means
- Linear, Logistic Regression implemented on BSP
In ApacheCon 2012, Tommaso makes an interesting point on suitability of BSP Model for iterative machine learning algorithms here - Today contains examples implemented:
- PageRank
- SSSP
- BiPartite Matching

Current focus
- Graph storage
- Message Mapping Graph Module ApacheCon 2013 Portland Barcamp
See the full transcript