**Luke Bornn, Yutian Chen, Nando de Freitas,**

Mareija Eskelin, Jing Fang, Max Welling

Mareija Eskelin, Jing Fang, Max Welling

**Herded Gibbs Sampling**

Herding

A deterministic procedure for generating pseudo-samples

Gibbs Sampling

**Presentation by Mareija Eskelin**

Herded Gibbs

We can build a deterministic Gibbs for unnormalized distributions:

Analysis

ANALYZE a moving window of herded Gibbs' samples,

, that approximate the target distribution

- weight vector

Experiments

Named Entity Recognition

Image Denoising

A Simple Complete Graph

Recall the example employing 2 variables

Performance on the Binary

2-Variable Model

Gibbs

Herded Gibbs

= 0.1

= 0.01

= 0.001

= 0.0001

For smaller it is harder to transition from state (0,0) to (1,1)

Modifying the pre-trained 3 class Stanford NER we evaluated the performance of herded Gibbs on the inference (labeling) NER task

NER is typically carried out by the Viterbi algorithm

In order to compare to Viterbi we chose to evaluate the performance of herded Gibbs on chain-CRFs

Performance is measured in per-entity F1 ( a larger score is better). The average computation time is indicated in square brackets. The standard deviation in F1 for Gibbs is indicated in round braces

CRF for NER Performance

Noisy Image

Original Image

The results were averaged over 10 corrupted images with Gaussian noise N(0,16), error bars correspond to 1 standard deviation and D is the damping factor for Mean Field Inference.

Image Denoising Reconstruction Error

- an observed pixel of the noisy image

- an unknown pixel of the original image

Make the assumption that nearby pixel should have similar values and so we set the coupling parameters (for neighbouring pixels) to 1

**Conclusions**

O(1/T) convergence guarantee for models with independent variables and fully-connected graphical models

The above objective is minimized by generating samples according to the following recursion

Theorem

COMPARE to a Gibbs Markov Chain, , initialized at

The Gibbs Markov Chain has a known geometric convergence, i.e. the Gibbs distribution converges geometrically (this is not the same as the convergence rate of ergodic averages)

BOUND the discrepancy between and by O(1/T)

recall was when we began collecting T samples (which had a convergence rate of O(1/T))

We show geometric convergence of herded Gibbs when FAR from the target

Thus we obtain a limit on for a burn-in period of

1

3

- feature vector

- moment vector

- sample

Alternates either systematically or randomly the sampling of each variable given the configuration of its neighbours

Slower MCMC convergence rate of

for ergodic averages.

This is a consequence of the Central Limit Theorem

However, herded Gibbs requires storing the weights for the conditionals

Conditional Random Field

Herded Gibbs could be applied to the more expressive skip-chain CRFs for further performance improvements

Herded Gibbs has O(1/T) convergence to the target distribution , defined on a complete graph, with an O(log(T)) burn-in period, where T is the number of samples generated.

is the herded Gibbs empirical estimate of the joint employing T samples, collected beginning at time

O(1/T) Convergence is Obtained when we:

Demonstrates promising empirical performance for select real-world problems

A novel deterministic variant of Gibbs

For an absolute convergence rate of herded Gibbs we evaluate the distance between and :

because of the geometric convergence of the Gibbs chain

What about ?

4

5

Annealed Gibbs

Herded Gibbs

Viterbi

100

400

800

84.36 (0.16)

84.70

[59.08 s]

[55.73 s]

84.51 (0.10)

[83.49 s]

84.75

[90.48 s]

84.61 (0.05)

[115.92 s]

84.81

[132 s]

84.81

[46.74 s]

Herded Gibbs achieves a higher score than Annealed Gibbs in less time

Herded Gibbs achieves performance equal to that of Viterbi

We stressed and explored the performance of herded Gibbs by manipulating the above joint:

as decreases more iterations are required for convergence

Note: this analysis relies on the reachability of all states hence the restriction to fully complete graphical models

6

7

8

9

10

11

13

14

15

16

17

18

Herding transforms sampling to optimization (minimization)

T is unknown number

of samples

Herding had only been applied to normalized distributions.

In contrast...

In the remainder of the talk, we will see that:

(i) For complete graphs, herded Gibbs retains the O(1/T) rate of herding.

(ii) It empirically performs better than Gibbs and mean field variants for image denoising and named entity recognition tasks.

Target Distribution

S

S

S

S

S

1

2

3

4

5

1/10 1/5 1/2 0 1/5

We are generating samples to MATCH the target

Let us consider herding on a model with 5 states

Instead of sampling from each conditional, apply herding to each conditional.

Log-Plot of Performance on the Binary 2-Variable Model

= 0.1 = 0.01 = 0.001 = 0.0001

Herded Gibbs

Gibbs

12

Convergence rate for ergodic averages is O(1/T), where T is the number of samples

Empirical Estimate

of the Target

S

i

w

i

2

Herding Weights

Herding Dynamics