### Present Remotely

Send the link below via email or IM

CopyPresent to your audience

Start remote presentation- Invited audience members
**will follow you**as you navigate and present - People invited to a presentation
**do not need a Prezi account** - This link expires
**10 minutes**after you close the presentation - A maximum of
**30 users**can follow your presentation - Learn more about this feature in our knowledge base article

# Scalable Graph Clustering with Pregel

Bryan Perozzi's presentation at CompleNet4, the 4th Workshop on Complex Networks

by

Tweet## Chris McCubbin

on 21 March 2013#### Transcript of Scalable Graph Clustering with Pregel

BSP Graph Processing Scalable Graph Clustering with Pregel Algorithm Experiments

Future Work Test Setup Experimental Results and Scalability Thanks for Listening! Clustering Whole

Graphs MapReduce

Sweep Theoretical

Results Step 1:

Parallel Approximate

PageRank Motivating Question Hadoop

MapReduce

Pregel/Apache Giraph Local Graph

Algorithms Motivation and

Related Work We first need to compute approximate PageRank vectors in parallel. We use MapReduce to sweep the APR vectors generated in Giraph Have: Parallelizable method for finding local clusters in a distributed graph.

Want: Procedure to find a clustering of the graph SGI Cloudrack C2

34 Compute Trays with:

2x Intel Xeon E5649 CPUs (6 cores each)

12x 8GB DIMM memory

4x 3TB 7200 SATA drives

2x 600GB SSD drives

1x M2070 NVIDIA GPU

1x InfiniBandQDR, 2x 1Gb Ethernet

4 Administration Trays with:

4x Intel Xeon E5649 CPUs (2 node boards each with 2 CPUs)

24x 8GB DIMM memory

8x 3TB SATA drives (RAID Controller)

4x 600GB SSD drives

2x InfiniBandQDR ,4x 1Gb Ethernet (2 nodes have additional 10GbE for ingest) Scales as expected, up to a point. Proof that PAPR converges to APR Complexity of PAPR: Limitation of Pregel model Pregel model can cut too many edges.

One solution: Vertex cuts (Gonzalez, et. al. 2012) Sweeping PPR to find clusters If we degree-normalize the nodes' PPR values, this imposes a order on the nodes n_i It can be shown that one of the sets {n_0...n_k} will contain a set of nearly-optimal conductance containing the start node. i.e. minimize this We can perform a superstep where one "push" operation is executed on all nodes at once.

Each superstep, vertices:

Process incoming messages (sum incoming PR)

"push" PR to neighbors

When all vertices have finished, the next superstep can begin. http://blog.octo.com/en/introduction-to-large-scale-graph-processing/ To implement PPR in practice, the push operation shown here will converge to the PPR of v_i when we initialize with

p_0 = 0

r_0 = the weight vector for v_i The output of Parallel APR is stored in HDFS. Sketch of MapReduce algorithm:

Mapper joins members of each PPR together, and sorts in descending order

Each Reducer sweeps it's vector and builds its cluster independently (embarrassingly parallel). (Checking the conductance of a cluster in reducer can be performed in a streaming manner - fast!) Chris McCubbin JT Halbert Spencer Beecher Bryan Perozzi How can we reason about community structure in big distributed graphs? graphs are everywhere. graphs typically stored in file systems or databases Personalized PageRank (PPR) Vectors Determine local importance vs global importance by weighting vertices (e.g. locally bias the computation) 1x10^9 User accounts

100x10^9 Images If PR is initialized with the uniform distribution: global PageRank.

If PR is initialized with all the weight on one vertex v_i: personalized PageRank for v_i. 14x10^9 internet-connected devices

7x10^9 Mobile Devices

2x10^9 people on the internet Approximate Personal Pagerank (APR) (Andersen, et. al 2006) 2x10^12 searches per year

5x10^6 searches per day Problem 1 How to select parameters? Community seed nodes: Hard to select but some guidance:

Degree

Global PageRank

Triangle Counting in vertex neighborhood

Community size:

No easy way out Solution 1: Randomly Sample Draw seed nodes based on the stationary distribution. (degree weighted probability) Draw community sizes from a distribution that focuses on finding smaller clusters. (Spielman and Teng’s RandomNibble 2003) Problem 2 Generating clusters in parallel means they can now overlap. (No help from the past: Speilman and Teng use a procedure which finds each cluster serially) How can we combine these overlapping local clusters? Solution 2 : No good solution Competing needs: Want to compute many potential clusters at once. But more clusters means more overlaps to resolve. Tried deconfliction procedure based on set overlap, but results hard to interpret. Settled on preserving conductance: clusters with lower conductance can break apart looser clusters.

Better option: Allow overlapping clusters? Local graph algorithms are output sensitive - running time proportional to the size of the output. Andersen et. al. (2006): 'PageRank-Nibble', a local graph algorithm for finding clusters using Personalized PageRank vectors. Idea: Find local clusters in parallel with Pregel and MapReduce (This is a natural fit: Pregel designed to accelerate PageRank computation.) Hadoop Ecosystem Distributed filesystem Key-Value/Graph Store APR in Pregel This algorithm will converge in time proportional to the size of the local cluster that minimizes conductance Take away:

If work evenly distributed,

adding more workers (w) reduces running time. Example of results: loc-Gowalla (V=200K,E=10^6) Louvain fast unfolding Parallel-Nibble vs. Big idea: Personalized PageRank is a good way to build local clusters The Pregel Model

Full transcriptFuture Work Test Setup Experimental Results and Scalability Thanks for Listening! Clustering Whole

Graphs MapReduce

Sweep Theoretical

Results Step 1:

Parallel Approximate

PageRank Motivating Question Hadoop

MapReduce

Pregel/Apache Giraph Local Graph

Algorithms Motivation and

Related Work We first need to compute approximate PageRank vectors in parallel. We use MapReduce to sweep the APR vectors generated in Giraph Have: Parallelizable method for finding local clusters in a distributed graph.

Want: Procedure to find a clustering of the graph SGI Cloudrack C2

34 Compute Trays with:

2x Intel Xeon E5649 CPUs (6 cores each)

12x 8GB DIMM memory

4x 3TB 7200 SATA drives

2x 600GB SSD drives

1x M2070 NVIDIA GPU

1x InfiniBandQDR, 2x 1Gb Ethernet

4 Administration Trays with:

4x Intel Xeon E5649 CPUs (2 node boards each with 2 CPUs)

24x 8GB DIMM memory

8x 3TB SATA drives (RAID Controller)

4x 600GB SSD drives

2x InfiniBandQDR ,4x 1Gb Ethernet (2 nodes have additional 10GbE for ingest) Scales as expected, up to a point. Proof that PAPR converges to APR Complexity of PAPR: Limitation of Pregel model Pregel model can cut too many edges.

One solution: Vertex cuts (Gonzalez, et. al. 2012) Sweeping PPR to find clusters If we degree-normalize the nodes' PPR values, this imposes a order on the nodes n_i It can be shown that one of the sets {n_0...n_k} will contain a set of nearly-optimal conductance containing the start node. i.e. minimize this We can perform a superstep where one "push" operation is executed on all nodes at once.

Each superstep, vertices:

Process incoming messages (sum incoming PR)

"push" PR to neighbors

When all vertices have finished, the next superstep can begin. http://blog.octo.com/en/introduction-to-large-scale-graph-processing/ To implement PPR in practice, the push operation shown here will converge to the PPR of v_i when we initialize with

p_0 = 0

r_0 = the weight vector for v_i The output of Parallel APR is stored in HDFS. Sketch of MapReduce algorithm:

Mapper joins members of each PPR together, and sorts in descending order

Each Reducer sweeps it's vector and builds its cluster independently (embarrassingly parallel). (Checking the conductance of a cluster in reducer can be performed in a streaming manner - fast!) Chris McCubbin JT Halbert Spencer Beecher Bryan Perozzi How can we reason about community structure in big distributed graphs? graphs are everywhere. graphs typically stored in file systems or databases Personalized PageRank (PPR) Vectors Determine local importance vs global importance by weighting vertices (e.g. locally bias the computation) 1x10^9 User accounts

100x10^9 Images If PR is initialized with the uniform distribution: global PageRank.

If PR is initialized with all the weight on one vertex v_i: personalized PageRank for v_i. 14x10^9 internet-connected devices

7x10^9 Mobile Devices

2x10^9 people on the internet Approximate Personal Pagerank (APR) (Andersen, et. al 2006) 2x10^12 searches per year

5x10^6 searches per day Problem 1 How to select parameters? Community seed nodes: Hard to select but some guidance:

Degree

Global PageRank

Triangle Counting in vertex neighborhood

Community size:

No easy way out Solution 1: Randomly Sample Draw seed nodes based on the stationary distribution. (degree weighted probability) Draw community sizes from a distribution that focuses on finding smaller clusters. (Spielman and Teng’s RandomNibble 2003) Problem 2 Generating clusters in parallel means they can now overlap. (No help from the past: Speilman and Teng use a procedure which finds each cluster serially) How can we combine these overlapping local clusters? Solution 2 : No good solution Competing needs: Want to compute many potential clusters at once. But more clusters means more overlaps to resolve. Tried deconfliction procedure based on set overlap, but results hard to interpret. Settled on preserving conductance: clusters with lower conductance can break apart looser clusters.

Better option: Allow overlapping clusters? Local graph algorithms are output sensitive - running time proportional to the size of the output. Andersen et. al. (2006): 'PageRank-Nibble', a local graph algorithm for finding clusters using Personalized PageRank vectors. Idea: Find local clusters in parallel with Pregel and MapReduce (This is a natural fit: Pregel designed to accelerate PageRank computation.) Hadoop Ecosystem Distributed filesystem Key-Value/Graph Store APR in Pregel This algorithm will converge in time proportional to the size of the local cluster that minimizes conductance Take away:

If work evenly distributed,

adding more workers (w) reduces running time. Example of results: loc-Gowalla (V=200K,E=10^6) Louvain fast unfolding Parallel-Nibble vs. Big idea: Personalized PageRank is a good way to build local clusters The Pregel Model