Prezi is an interactive zooming presentation

### Present Remotely

Send the link below via email or IM

• Invited audience members will follow you as you navigate and present
• People invited to a presentation do not need a Prezi account
• This link expires 10 minutes after you close the presentation

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

You can change this under Settings & Account at any time.

# Distributed SVM

No description
by

## Martina Zambelli

on 11 March 2013

Report abuse

#### Transcript of Distributed SVM

Martina Fischetti, Alberto Padoan, Martina Zambelli Consensus-Based
Distributed Optimization
for Machine Learning Main topics Newton-Raphson Method Conclusions Computational results ADMM distributed algorithm for the exact computation of the optimal solution, approximatively operating as a Newton-Raphson minimization procedure We have implemented and analyzed two alternative approaches for distributed classification in the context of Support Vector Machines
Alternating Direction Method of Multipliers
Newton-Raphson method

The two methods have been implemented in MATLAB and computationally compared on a test case. Martina Fischetti, Alberto Padoan, Martina Zambelli Consensus-Based Distributed Optimization
for Machine Learning distributed classification
in the context of
Support Vector Machines
(SVM) SVM
for classification supervised learning method that assumes the availability of a set of training points to be used to define the main parameters of the final classification rule Algorithms:
NR Algorithms The original approach is centralized, in the sense that it requires the explicit knowledge of all training points BUT in some important cases the training points involve sensitive information that cannot be distributed DISTRIBUTED SVM Alternating Direction Method of Multipliers (ADMM)
Newton-Raphson (NR) method SVM for classification 1. Optimal Hyperplan create an optimal hyperplane to separate two classes of points by maximizing its distance from the closest point from either classes a training set consisting of N pairs the aim: determine a hyperplane in the feature space that separates the training points according to their classification 2. Convex Optimization looking for the “most robust one”, i.e., for the one that maximizes the separation margin only some points are in fact used
in the optimization SUPPORT VECTOR 3. Classification a new point x not belonging to the training set can be classified by defining its class y(x) through the simple formula: NONSEPARABLE CASE 2 approaches:
allow misclassifications
(LOSS-FUNCTION)
lift to a higher dimensional space (KERNEL TRICK) LOSS FUNCTION quantifies the “classification error” and solves a modified optimization problem of the form where C is a parameter that gives a tradeoff between margin and misclassification. define a convex optimization problem that can be solved efficiently KERNEL TRICK enlarging the feature space in the hope that the training points become linearly separable in the new space. curse of dimensionality instead of working explicitly on the extended feature space, define a symmetric positive semi-definite kernel function Main problem The Dual Problem Distributed version Node/edge failures Restatements and Augmented Lagrangian reduction For each given ū we have: 3 STEPS: Step 1

Step 2

Step 3 Framework (sparse) undirected and connected graph G = (V,E) with NC := |V| nodes, each associated with a certain computational resource that we call processor.
each edge (i; j) corresponds to a bidirected connection between nodes i and j, and allows the two associated processors to share data.
assume that the N input points of the training set are split among the processor, in the sense that a partition is given, and each node only has access to the training points of its cluster. Goal solve the convex optimization rewritten as Steps: Step 1:

Step 2:

Step 3: Step 2 can be solved in closed form:
by imposing grad(.) = 0 we get the stationary condition: each α is just the average β of its neighborhoods,
minus a correcting term weighed by 1/ρ case where some nodes become temporary unavailable during the computation (the case of edge failures is analogous). case that the probability of having all nodes active at the same time (i.e., at some ADMM iteration) is very low. Naive solution just skipping the “unavailable terms” in the Lagrangian function

such a policy turned out to be inapplicable as the overall convergence is heavily affected even in case of a very low node failure probability Proposed solution keep a “last seen” backup copy of the α(i)’s and β(i)’s of its neighborhood nodes, and of the last u(i, j)’s of the arcs incident in h.
whenever a node is active, it sends its vectors to its active neighbors, so they can update their backup copies.
use backup copies in case of node failures
after the first iterations all copies tend to stabilize, so the impact of using a backup copy instead of the “true” one becomes less and less relevant, and the overall convergence properties are preserved Newton-Raphson Consensus (NRC) Distributed
Unconstrained Minimization cost function
global function to be minimized
the minimizer Consensus Algorithm
and Asynchronous NRC each agent defines the local variables

two average consensus algorithms: P is a consensus matrix local estimate at time k that each agent
has about the global minimizer Remark: the previous algorithm can be modified, in order to work in the general case where the local cost functions are strictly convex, but not necessarily quadratic Robustness
and Convergence under some simplifying assumptions, the initial conditions on the initial estimates can be arbitrary, however initial conditions on other parameters can change the final convergence point and might even lead to instability for sufficiently large values
convergence to the optimal solution for appropriate choices of the algorithm parameters (principle of separation of time-scales)
properties of consensus algorithms:
- simplicity
- potential implementation with asynchronous
communication schemes
- ability to adapt to time-varying network
topologies ADMM performance NRC performance Comparisons We have also proposed a fault-tolerant variant.
The approach is also suited for an
asynchronous implementation. Preliminary debugging phase Tests Node failures Preliminary debugging phase Comparison with the centralized SVM:
comparison of the results computed by the distributed ADMM and by the centralized case (NC = 1, ρ = 0, and aborting the ADMM right after the first execution of Step 1)

Check that the final hyperplane
we verified that the final hyperplane in the feature space correctly separates all training point in the linearly-separable case (very large C parameter) Tests generation of random instances with different graph size and density, number of training points, and misclassification probability

logarithmic loss function ADMM behavior over iter.s

ADMM behavior over time Node Failures at each ADMM iteration k, for each node v we generated a node-failure flag fail(v) which is true with a given probability
(with same items!!) NAIVE SOLUTION

ROBUST IMPLEMENTATION Scalar case: Multidimensional case cost function
the minimizer Remark: replace derivatives with
Full Hessian
Jacobi approximation
Consensus Matrix created from the adjacency matrix A

Laplacian method: for each edge (i,j) method getProtocol(): returns a collection of matrices implementing the symmetric gossip protocol for a given adjacency matrix.
(Metropolis-Hastings) ASYNCHRONOUS NRC Node/edge failures Naive solution Proposed solution naive modification of the implemented asynchronous version of NRC. For each failed node h, we just keep the self-loop related to the node h in the consensus matrix P
the average consensus property of P is not preserved
convergence to consensus is not guaranteed, as P is no more an average consensus matrix simulate the nodes failures, and randomly choose among a list of matrices associated to edges linking two active nodes
if either there are no active nodes in the network or there is just one, we use the identity matrix for the consensus step
the asynchronous NRC takes into account only the available links, which is a appreciable feature in when dealing with time varying networks Preliminary debugging phase Tests Node failures Preliminary debugging phase Comparison with the centralized SVM:
comparison of the results (NC = 1)

Check that the final hyperplane
verified that in the linearly separable
case and the final hyperplane in the feature space Tests generation of random instances with different graph size and density, number of training points, and misclassification probability

logarithmic loss function

performance of NRC algorithm by choosing three different hessians: Typical behaviour Asynchronous NRC behaviour Node/edge failures Naive implementation Proposed solution node failure probability 20% node failure probability 80% node failure probability 20% node failure probability 20% Graphs' topologies Tests on graphs with different topologies function called getTopology(NC,topology) returning an adjacency matrix representing a graph with the assigned number of nodes NC and a topology that can be chosen among:
- star
- ring
- (fully connected) mesh
- line NC = 5 nodes
7 undirected edges
(plus the self-loops)
dimension of the feature
space f = 5
the dataset has 100 points
spread into the different clusters Synchronous case Asynchronous case p_mis = 0,2 p_mis = 0,2 p_mis = 0,8 Future directions design of auto-tuning algorithms for parameter in the asynchronous NRC algorithm extending our approach by addressing known SVM kernels that can be explicitly expressed in an (enlarged but manageable) extended feature space, thus preserving the privacy of the training points
Thank you for the attention!
Full transcript