Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Do you really want to delete this prezi?
Neither you, nor the coeditors you shared it with will be able to recover it again.
Make your likes visible on Facebook?
You can change this under Settings & Account at any time.
Herded Gibbs Sampling
Transcript of Herded Gibbs Sampling
Mareija Eskelin, Jing Fang, Max Welling
Herded Gibbs Sampling
A deterministic procedure for generating pseudo-samples
Presentation by Mareija Eskelin
We can build a deterministic Gibbs for unnormalized distributions:
ANALYZE a moving window of herded Gibbs' samples,
, that approximate the target distribution
- weight vector
Named Entity Recognition
A Simple Complete Graph
Recall the example employing 2 variables
Performance on the Binary
For smaller it is harder to transition from state (0,0) to (1,1)
Modifying the pre-trained 3 class Stanford NER we evaluated the performance of herded Gibbs on the inference (labeling) NER task
NER is typically carried out by the Viterbi algorithm
In order to compare to Viterbi we chose to evaluate the performance of herded Gibbs on chain-CRFs
Performance is measured in per-entity F1 ( a larger score is better). The average computation time is indicated in square brackets. The standard deviation in F1 for Gibbs is indicated in round braces
CRF for NER Performance
The results were averaged over 10 corrupted images with Gaussian noise N(0,16), error bars correspond to 1 standard deviation and D is the damping factor for Mean Field Inference.
Image Denoising Reconstruction Error
- an observed pixel of the noisy image
- an unknown pixel of the original image
Make the assumption that nearby pixel should have similar values and so we set the coupling parameters (for neighbouring pixels) to 1
O(1/T) convergence guarantee for models with independent variables and fully-connected graphical models
The above objective is minimized by generating samples according to the following recursion
COMPARE to a Gibbs Markov Chain, , initialized at
The Gibbs Markov Chain has a known geometric convergence, i.e. the Gibbs distribution converges geometrically (this is not the same as the convergence rate of ergodic averages)
BOUND the discrepancy between and by O(1/T)
recall was when we began collecting T samples (which had a convergence rate of O(1/T))
We show geometric convergence of herded Gibbs when FAR from the target
Thus we obtain a limit on for a burn-in period of
- feature vector
- moment vector
Alternates either systematically or randomly the sampling of each variable given the configuration of its neighbours
Slower MCMC convergence rate of
for ergodic averages.
This is a consequence of the Central Limit Theorem
However, herded Gibbs requires storing the weights for the conditionals
Conditional Random Field
Herded Gibbs could be applied to the more expressive skip-chain CRFs for further performance improvements
Herded Gibbs has O(1/T) convergence to the target distribution , defined on a complete graph, with an O(log(T)) burn-in period, where T is the number of samples generated.
is the herded Gibbs empirical estimate of the joint employing T samples, collected beginning at time
O(1/T) Convergence is Obtained when we:
Demonstrates promising empirical performance for select real-world problems
A novel deterministic variant of Gibbs
For an absolute convergence rate of herded Gibbs we evaluate the distance between and :
because of the geometric convergence of the Gibbs chain
What about ?
Herded Gibbs achieves a higher score than Annealed Gibbs in less time
Herded Gibbs achieves performance equal to that of Viterbi
We stressed and explored the performance of herded Gibbs by manipulating the above joint:
as decreases more iterations are required for convergence
Note: this analysis relies on the reachability of all states hence the restriction to fully complete graphical models
Herding transforms sampling to optimization (minimization)
T is unknown number
Herding had only been applied to normalized distributions.
In the remainder of the talk, we will see that:
(i) For complete graphs, herded Gibbs retains the O(1/T) rate of herding.
(ii) It empirically performs better than Gibbs and mean field variants for image denoising and named entity recognition tasks.
1/10 1/5 1/2 0 1/5
We are generating samples to MATCH the target
Let us consider herding on a model with 5 states
Instead of sampling from each conditional, apply herding to each conditional.
Log-Plot of Performance on the Binary 2-Variable Model
= 0.1 = 0.01 = 0.001 = 0.0001
Convergence rate for ergodic averages is O(1/T), where T is the number of samples
of the Target