### Present Remotely

Send the link below via email or IM

• Invited audience members will follow you as you navigate and present
• People invited to a presentation do not need a Prezi account
• This link expires 10 minutes after you close the presentation

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

You can change this under Settings & Account at any time.

# Scalable Graph Clustering with Pregel

Bryan Perozzi's presentation at CompleNet4, the 4th Workshop on Complex Networks
by

## Chris McCubbin

on 21 March 2013

Report abuse

#### Transcript of Scalable Graph Clustering with Pregel

BSP Graph Processing Scalable Graph Clustering with Pregel Algorithm Experiments
Future Work Test Setup Experimental Results and Scalability Thanks for Listening! Clustering Whole
Graphs MapReduce
Sweep Theoretical
Results Step 1:
Parallel Approximate
MapReduce
Pregel/Apache Giraph Local Graph
Algorithms Motivation and
Related Work We first need to compute approximate PageRank vectors in parallel. We use MapReduce to sweep the APR vectors generated in Giraph Have: Parallelizable method for finding local clusters in a distributed graph.
Want: Procedure to find a clustering of the graph SGI Cloudrack C2

34 Compute Trays with:
2x Intel Xeon E5649 CPUs (6 cores each)
12x 8GB DIMM memory
4x 3TB 7200 SATA drives
2x 600GB SSD drives
1x M2070 NVIDIA GPU
1x InfiniBandQDR, 2x 1Gb Ethernet

4x Intel Xeon E5649 CPUs (2 node boards each with 2 CPUs)
24x 8GB DIMM memory
8x 3TB SATA drives (RAID Controller)
4x 600GB SSD drives
2x InfiniBandQDR ,4x 1Gb Ethernet (2 nodes have additional 10GbE for ingest) Scales as expected, up to a point. Proof that PAPR converges to APR Complexity of PAPR: Limitation of Pregel model Pregel model can cut too many edges.
One solution: Vertex cuts (Gonzalez, et. al. 2012) Sweeping PPR to find clusters If we degree-normalize the nodes' PPR values, this imposes a order on the nodes n_i It can be shown that one of the sets {n_0...n_k} will contain a set of nearly-optimal conductance containing the start node. i.e. minimize this We can perform a superstep where one "push" operation is executed on all nodes at once.

Each superstep, vertices:
Process incoming messages (sum incoming PR)
"push" PR to neighbors

When all vertices have finished, the next superstep can begin. http://blog.octo.com/en/introduction-to-large-scale-graph-processing/ To implement PPR in practice, the push operation shown here will converge to the PPR of v_i when we initialize with
p_0 = 0
r_0 = the weight vector for v_i The output of Parallel APR is stored in HDFS. Sketch of MapReduce algorithm:
Mapper joins members of each PPR together, and sorts in descending order
Each Reducer sweeps it's vector and builds its cluster independently (embarrassingly parallel). (Checking the conductance of a cluster in reducer can be performed in a streaming manner - fast!) Chris McCubbin JT Halbert Spencer Beecher Bryan Perozzi How can we reason about community structure in big distributed graphs? graphs are everywhere. graphs typically stored in file systems or databases Personalized PageRank (PPR) Vectors Determine local importance vs global importance by weighting vertices (e.g. locally bias the computation) 1x10^9 User accounts
100x10^9 Images If PR is initialized with the uniform distribution: global PageRank.

If PR is initialized with all the weight on one vertex v_i: personalized PageRank for v_i. 14x10^9 internet-connected devices
7x10^9 Mobile Devices
2x10^9 people on the internet Approximate Personal Pagerank (APR) (Andersen, et. al 2006) 2x10^12 searches per year
5x10^6 searches per day Problem 1 How to select parameters? Community seed nodes: Hard to select but some guidance:
Degree
Global PageRank
Triangle Counting in vertex neighborhood
Community size:
No easy way out Solution 1: Randomly Sample Draw seed nodes based on the stationary distribution. (degree weighted probability) Draw community sizes from a distribution that focuses on finding smaller clusters. (Spielman and Teng’s RandomNibble 2003) Problem 2 Generating clusters in parallel means they can now overlap. (No help from the past: Speilman and Teng use a procedure which finds each cluster serially) How can we combine these overlapping local clusters? Solution 2 : No good solution Competing needs: Want to compute many potential clusters at once. But more clusters means more overlaps to resolve. Tried deconfliction procedure based on set overlap, but results hard to interpret. Settled on preserving conductance: clusters with lower conductance can break apart looser clusters.

Better option: Allow overlapping clusters? Local graph algorithms are output sensitive - running time proportional to the size of the output. Andersen et. al. (2006): 'PageRank-Nibble', a local graph algorithm for finding clusters using Personalized PageRank vectors. Idea: Find local clusters in parallel with Pregel and MapReduce (This is a natural fit: Pregel designed to accelerate PageRank computation.) Hadoop Ecosystem Distributed filesystem Key-Value/Graph Store APR in Pregel This algorithm will converge in time proportional to the size of the local cluster that minimizes conductance Take away:
If work evenly distributed,
adding more workers (w) reduces running time. Example of results: loc-Gowalla (V=200K,E=10^6) Louvain fast unfolding Parallel-Nibble vs. Big idea: Personalized PageRank is a good way to build local clusters The Pregel Model
Full transcript