Martina Fischetti, Alberto Padoan, Martina Zambelli Consensus-Based

Distributed Optimization

for Machine Learning Main topics Newton-Raphson Method Conclusions Computational results ADMM distributed algorithm for the exact computation of the optimal solution, approximatively operating as a Newton-Raphson minimization procedure We have implemented and analyzed two alternative approaches for distributed classification in the context of Support Vector Machines

Alternating Direction Method of Multipliers

Newton-Raphson method

The two methods have been implemented in MATLAB and computationally compared on a test case. Martina Fischetti, Alberto Padoan, Martina Zambelli Consensus-Based Distributed Optimization

for Machine Learning distributed classification

in the context of

Support Vector Machines

(SVM) SVM

for classification supervised learning method that assumes the availability of a set of training points to be used to define the main parameters of the final classification rule Algorithms:

ADMM

NR Algorithms The original approach is centralized, in the sense that it requires the explicit knowledge of all training points BUT in some important cases the training points involve sensitive information that cannot be distributed DISTRIBUTED SVM Alternating Direction Method of Multipliers (ADMM)

Newton-Raphson (NR) method SVM for classification 1. Optimal Hyperplan create an optimal hyperplane to separate two classes of points by maximizing its distance from the closest point from either classes a training set consisting of N pairs the aim: determine a hyperplane in the feature space that separates the training points according to their classification 2. Convex Optimization looking for the “most robust one”, i.e., for the one that maximizes the separation margin only some points are in fact used

in the optimization SUPPORT VECTOR 3. Classification a new point x not belonging to the training set can be classified by defining its class y(x) through the simple formula: NONSEPARABLE CASE 2 approaches:

allow misclassifications

(LOSS-FUNCTION)

lift to a higher dimensional space (KERNEL TRICK) LOSS FUNCTION quantifies the “classification error” and solves a modified optimization problem of the form where C is a parameter that gives a tradeoff between margin and misclassification. define a convex optimization problem that can be solved efficiently KERNEL TRICK enlarging the feature space in the hope that the training points become linearly separable in the new space. curse of dimensionality instead of working explicitly on the extended feature space, define a symmetric positive semi-definite kernel function Main problem The Dual Problem Distributed version Node/edge failures Restatements and Augmented Lagrangian reduction For each given ū we have: 3 STEPS: Step 1

Step 2

Step 3 Framework (sparse) undirected and connected graph G = (V,E) with NC := |V| nodes, each associated with a certain computational resource that we call processor.

each edge (i; j) corresponds to a bidirected connection between nodes i and j, and allows the two associated processors to share data.

assume that the N input points of the training set are split among the processor, in the sense that a partition is given, and each node only has access to the training points of its cluster. Goal solve the convex optimization rewritten as Steps: Step 1:

Step 2:

Step 3: Step 2 can be solved in closed form:

by imposing grad(.) = 0 we get the stationary condition: each α is just the average β of its neighborhoods,

minus a correcting term weighed by 1/ρ case where some nodes become temporary unavailable during the computation (the case of edge failures is analogous). case that the probability of having all nodes active at the same time (i.e., at some ADMM iteration) is very low. Naive solution just skipping the “unavailable terms” in the Lagrangian function

such a policy turned out to be inapplicable as the overall convergence is heavily affected even in case of a very low node failure probability Proposed solution keep a “last seen” backup copy of the α(i)’s and β(i)’s of its neighborhood nodes, and of the last u(i, j)’s of the arcs incident in h.

whenever a node is active, it sends its vectors to its active neighbors, so they can update their backup copies.

use backup copies in case of node failures

after the first iterations all copies tend to stabilize, so the impact of using a backup copy instead of the “true” one becomes less and less relevant, and the overall convergence properties are preserved Newton-Raphson Consensus (NRC) Distributed

Unconstrained Minimization cost function

global function to be minimized

the minimizer Consensus Algorithm

and Asynchronous NRC each agent defines the local variables

and updates them based on

two average consensus algorithms: P is a consensus matrix local estimate at time k that each agent

has about the global minimizer Remark: the previous algorithm can be modified, in order to work in the general case where the local cost functions are strictly convex, but not necessarily quadratic Robustness

and Convergence under some simplifying assumptions, the initial conditions on the initial estimates can be arbitrary, however initial conditions on other parameters can change the final convergence point and might even lead to instability for sufficiently large values

convergence to the optimal solution for appropriate choices of the algorithm parameters (principle of separation of time-scales)

properties of consensus algorithms:

- simplicity

- potential implementation with asynchronous

communication schemes

- ability to adapt to time-varying network

topologies ADMM performance NRC performance Comparisons We have also proposed a fault-tolerant variant.

The approach is also suited for an

asynchronous implementation. Preliminary debugging phase Tests Node failures Preliminary debugging phase Comparison with the centralized SVM:

comparison of the results computed by the distributed ADMM and by the centralized case (NC = 1, ρ = 0, and aborting the ADMM right after the first execution of Step 1)

Check that the final hyperplane

we verified that the final hyperplane in the feature space correctly separates all training point in the linearly-separable case (very large C parameter) Tests generation of random instances with different graph size and density, number of training points, and misclassification probability

logarithmic loss function ADMM behavior over iter.s

ADMM behavior over time Node Failures at each ADMM iteration k, for each node v we generated a node-failure flag fail(v) which is true with a given probability

(with same items!!) NAIVE SOLUTION

ROBUST IMPLEMENTATION Scalar case: Multidimensional case cost function

the minimizer Remark: replace derivatives with

gradient and hessian matrix

Full Hessian

Jacobi approximation

Gradient Descent

Consensus Matrix created from the adjacency matrix A

Laplacian method: for each edge (i,j) method getProtocol(): returns a collection of matrices implementing the symmetric gossip protocol for a given adjacency matrix.

(Metropolis-Hastings) ASYNCHRONOUS NRC Node/edge failures Naive solution Proposed solution naive modification of the implemented asynchronous version of NRC. For each failed node h, we just keep the self-loop related to the node h in the consensus matrix P

the average consensus property of P is not preserved

convergence to consensus is not guaranteed, as P is no more an average consensus matrix simulate the nodes failures, and randomly choose among a list of matrices associated to edges linking two active nodes

if either there are no active nodes in the network or there is just one, we use the identity matrix for the consensus step

the asynchronous NRC takes into account only the available links, which is a appreciable feature in when dealing with time varying networks Preliminary debugging phase Tests Node failures Preliminary debugging phase Comparison with the centralized SVM:

comparison of the results (NC = 1)

Check that the final hyperplane

verified that in the linearly separable

case and the final hyperplane in the feature space Tests generation of random instances with different graph size and density, number of training points, and misclassification probability

logarithmic loss function

performance of NRC algorithm by choosing three different hessians: Typical behaviour Asynchronous NRC behaviour Node/edge failures Naive implementation Proposed solution node failure probability 20% node failure probability 80% node failure probability 20% node failure probability 20% Graphs' topologies Tests on graphs with different topologies function called getTopology(NC,topology) returning an adjacency matrix representing a graph with the assigned number of nodes NC and a topology that can be chosen among:

- star

- ring

- (fully connected) mesh

- line NC = 5 nodes

7 undirected edges

(plus the self-loops)

dimension of the feature

space f = 5

the dataset has 100 points

spread into the different clusters Synchronous case Asynchronous case p_mis = 0,2 p_mis = 0,2 p_mis = 0,8 Future directions design of auto-tuning algorithms for parameter in the asynchronous NRC algorithm extending our approach by addressing known SVM kernels that can be explicitly expressed in an (enlarged but manageable) extended feature space, thus preserving the privacy of the training points

Thank you for the attention!

### Present Remotely

Send the link below via email or IM

CopyPresent to your audience

Start remote presentation- Invited audience members
**will follow you**as you navigate and present - People invited to a presentation
**do not need a Prezi account** - This link expires
**10 minutes**after you close the presentation - A maximum of
**30 users**can follow your presentation - Learn more about this feature in our knowledge base article

# Distributed SVM

No description

by

Tweet