### Present Remotely

Send the link below via email or IM

• Invited audience members will follow you as you navigate and present
• People invited to a presentation do not need a Prezi account
• This link expires 10 minutes after you close the presentation

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

# Herded Gibbs Sampling

No description
by

## Mareija Eskelinen

on 21 February 2014

Report abuse

#### Transcript of Herded Gibbs Sampling

Luke Bornn, Yutian Chen, Nando de Freitas,
Mareija Eskelin, Jing Fang, Max Welling

Herded Gibbs Sampling
Herding
A deterministic procedure for generating pseudo-samples
Gibbs Sampling
Presentation by Mareija Eskelin
Herded Gibbs
We can build a deterministic Gibbs for unnormalized distributions:
Analysis
ANALYZE a moving window of herded Gibbs' samples,
, that approximate the target distribution
- weight vector
Experiments
Named Entity Recognition
Image Denoising
A Simple Complete Graph
Recall the example employing 2 variables
Performance on the Binary
2-Variable Model
Gibbs
Herded Gibbs
= 0.1
= 0.01
= 0.001
= 0.0001
For smaller it is harder to transition from state (0,0) to (1,1)
Modifying the pre-trained 3 class Stanford NER we evaluated the performance of herded Gibbs on the inference (labeling) NER task
NER is typically carried out by the Viterbi algorithm
In order to compare to Viterbi we chose to evaluate the performance of herded Gibbs on chain-CRFs
Performance is measured in per-entity F1 ( a larger score is better). The average computation time is indicated in square brackets. The standard deviation in F1 for Gibbs is indicated in round braces
CRF for NER Performance
Noisy Image
Original Image
The results were averaged over 10 corrupted images with Gaussian noise N(0,16), error bars correspond to 1 standard deviation and D is the damping factor for Mean Field Inference.
Image Denoising Reconstruction Error
- an observed pixel of the noisy image
- an unknown pixel of the original image
Make the assumption that nearby pixel should have similar values and so we set the coupling parameters (for neighbouring pixels) to 1
Conclusions
O(1/T) convergence guarantee for models with independent variables and fully-connected graphical models
The above objective is minimized by generating samples according to the following recursion
Theorem
COMPARE to a Gibbs Markov Chain, , initialized at
The Gibbs Markov Chain has a known geometric convergence, i.e. the Gibbs distribution converges geometrically (this is not the same as the convergence rate of ergodic averages)
BOUND the discrepancy between and by O(1/T)
recall was when we began collecting T samples (which had a convergence rate of O(1/T))

We show geometric convergence of herded Gibbs when FAR from the target

Thus we obtain a limit on for a burn-in period of
1
3
- feature vector
- moment vector
- sample
Alternates either systematically or randomly the sampling of each variable given the configuration of its neighbours

Slower MCMC convergence rate of
for ergodic averages.

This is a consequence of the Central Limit Theorem
However, herded Gibbs requires storing the weights for the conditionals
Conditional Random Field
Herded Gibbs could be applied to the more expressive skip-chain CRFs for further performance improvements
Herded Gibbs has O(1/T) convergence to the target distribution , defined on a complete graph, with an O(log(T)) burn-in period, where T is the number of samples generated.
is the herded Gibbs empirical estimate of the joint employing T samples, collected beginning at time
O(1/T) Convergence is Obtained when we:
Demonstrates promising empirical performance for select real-world problems
A novel deterministic variant of Gibbs
For an absolute convergence rate of herded Gibbs we evaluate the distance between and :
because of the geometric convergence of the Gibbs chain
4
5
Annealed Gibbs
Herded Gibbs
Viterbi
100
400
800
84.36 (0.16)
84.70
[59.08 s]
[55.73 s]
84.51 (0.10)
[83.49 s]
84.75
[90.48 s]
84.61 (0.05)
[115.92 s]
84.81
[132 s]
84.81
[46.74 s]
Herded Gibbs achieves a higher score than Annealed Gibbs in less time
Herded Gibbs achieves performance equal to that of Viterbi
We stressed and explored the performance of herded Gibbs by manipulating the above joint:
as decreases more iterations are required for convergence
Note: this analysis relies on the reachability of all states hence the restriction to fully complete graphical models
6
7
8
9
10
11
13
14
15
16
17
18
Herding transforms sampling to optimization (minimization)
T is unknown number
of samples
Herding had only been applied to normalized distributions.
In contrast...
In the remainder of the talk, we will see that:
(i) For complete graphs, herded Gibbs retains the O(1/T) rate of herding.
(ii) It empirically performs better than Gibbs and mean field variants for image denoising and named entity recognition tasks.
Target Distribution
S
S
S
S
S
1
2
3
4
5
1/10 1/5 1/2 0 1/5
We are generating samples to MATCH the target
Let us consider herding on a model with 5 states
Instead of sampling from each conditional, apply herding to each conditional.
Log-Plot of Performance on the Binary 2-Variable Model
= 0.1 = 0.01 = 0.001 = 0.0001
Herded Gibbs
Gibbs
12
Convergence rate for ergodic averages is O(1/T), where T is the number of samples
Empirical Estimate
of the Target
S
i
w
i
2
Herding Weights
Herding Dynamics
Full transcript