**Before ...**

EM not enough ...

Goal: approximate the full distribution

**But why ?**

**Statistical Framework**

**What the ...**

**Bayesian Approach**

**Come back to LDA**

**Hierarchical Process**

**Conjugate priors**

**Exponential families**

**Latent variables**

Collapsed Gibbs Sampling

**MCMC**

**variational approximation**

That's the purpose of LDA

Supervised Topic Models

Supervised LDA

Motivation ?

What is a probability of success of 3 heads

in 10 throws ?

(Maximum Likelihood assumed)

H T T T H T H T T T

Maximum Likelihood

Least Squares and ML ?

easy

interpretable

asymptotic properties

invariant under reparametrization

point esimate

reparametrization

**Latent Variablels**

Variational

Methods

Markov Chain Monte Carlo

Gibbs Sampling

Variational Methods

**Bayesian Approach**

A Bayesian approach to a problem starts with the formulation of a model that we hope is adequate to describe the situation of interest. We then formulated a prior distribution over the unknown parameters of the model, which is meant to capture our beliefs about the situation before seeing the data. After observing some data, we apply Bayes' Rule to obtain a posterior distribution for these unknowns, which takes account of both the prior and the data. From this posterior distribution we can compute predictive distributions for future observations.

But before that ...

Maximum Aposteriori

Aposteriori Distribution

BTW, ML is degenerated case of MAP (uniform prior)

Predictive Distribution

Beta Distribution

Conjugate Priors !

import

pymc

as

mc

from

gen_data_bern

import

getData

throws = getData()

p = mc.Uniform(

'p'

, lower=

0

, upper=

1

)

obs = mc.Bernoulli(

"obs"

, p, value=throws, observed=

True

)

import

pymc

as

mc

from

gen_data_bern

import

getData

throws = getData()

alpha = mc.Exponential(

'alpha'

, beta=

0.5

)

beta = mc.Normal(

'beta'

, mu=

5

, tau=

1

)

p = mc.Beta(

'p'

, alpha=alpha, beta=beta)

obs = mc.Bernoulli(

"obs"

, p, value=throws, observed=

True

)

import

pymc

as

mc

import

numpy

as

np

import

model_binomial

model = model_binomial

iterations =

25000

model = mc.MCMC(model, db =

'pickle'

)

model.sample(iter=iterations, burn=

5000

,thin=

2

)

model.stats()

#saving graph

mc.graph.graph(model)

import

matplotlib.pyplot

as

plt

mc.Matplot.plot(model)

plt.show()

{

'p'

: {'95% HPD interval': array([ 0.5933474 , 0.64185615]), 'n': 122500, 'quantiles': {2.5: 0.59344379128091884, 25: 0.60948935734400123, 50: 0.61793721629785492, 75: 0.62633076257775722, 97.5: 0.64201057731255884}, 'standard deviation': 0.012458537921094534, 'mc error': 6.7276335499864257e-05,

'mean': 0.61787796180912014

},

'beta'

: {'95% HPD interval': array([ 2.73917959, 6.60336035]), 'n': 122500, 'quantiles': {2.5: 2.7878098015030157, 25: 4.0646363496998505, 50: 4.7246670506308996, 75: 5.3858697946343597, 97.5: 6.6563463874649784}, 'standard deviation': 0.98090170439034674, 'mc error': 0.0050643509188001805, 'mean': 4.7251490063439183},

'alpha'

: {'95% HPD interval': array([ 0.38188356, 9.13365097]), 'n': 122500, 'quantiles': {2.5: 0.86469480018420042, 25: 2.6101111578571743, 50: 4.0243032728063861, 75: 5.7832094355613153, 97.5: 10.189414762984107}, 'standard deviation': 2.4364926610821929, 'mc error': 0.011861474172469305, 'mean': 4.4126960051489679}}

In future Generalized Linear Models ...

||x1 - x2|| = 999

||Ax1 - Ax2|| = 9.98999999835e-07

||x3 - x4|| = 0.1

||inv(A)x3 - inv(A)x4|| = 100000000.0

To overcome this problem, we

need to do something with matrix A:

regularization (come back in Maximum Aposteriori part)

Maximum Entropy Distribution (Gibbs Dis.)

solution with contraints:

http://stat.columbia.edu/~porbanz/

The Crucial Bit

Jensen + latent variable

||x1 - x2|| = 999

||Ax1 - Ax2|| = 9.98999999835e-07

||x3 - x4|| = 0.1

||inv(A)x3 - inv(A)x4|| = 100000000.0

Latent Dirichlet Allocation (LDA)

Graphical Models for Visual Object Recognition and Tracking (E. Sudderth)

Various graphical models

Standards (HMM)

**2003**

http://infolab.stanford.edu/~ullman/mmds/ch11.pdf

From coin to dice, from Beta to Dirichlet Distribution

2013-12-19 13:36:46,157 : INFO : calculating IDF weights for 9 documents and 12 features (28 matrix non-zeros)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]

[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555),

(6, 0.44424552527467476), (7, 0.3244870206138555)]

[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]

[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]

[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]

[(9, 1.0)]

[(9, 0.7071067811865475), (10, 0.7071067811865475)]

[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]

[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]

2013-12-19 13:36:46,159 : INFO : processed documents up to #9

2013-12-19 13:36:46,159 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" +

0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"

2013-12-19 13:36:46,160 : INFO : topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"time" + -0.320*"response" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"

TF-IDF

LSI Topics

number of times, when term t_i appears in document j

number of all terms in document j

number of documents

number of documents, where term i appears at least ones

documents

words

topics

words

topics

topics

documents

topics

http://d-nb.info/998079553/34

Latent Semantic Indexing

- Bayesian methods make assumptions where other methods don't

All methods make assumptions! Otherwise it's impossible to predict. Bayesian methods are transparent in their assumptions whereas other methods are often opaque.

- If you don't have the right prior you won't do well

Certainly a poor model will predict poorly but there is no such thing as

the right prior! Your model (both prior and likelihood) should capture a

reasonable range of possibilities. When in doubt you can choose vague priors (cf nonparametrics).

- Maximum A Posteriori (MAP) is a Bayesian method

MAP is similar to regularization and offers no particular Bayesian advantages.

The key ingredient in Bayesian methods is to average over your uncertain

variables and parameters, rather than to optimize.

Myths and misconceptions about Bayesian methods

Myths and misconceptions about Bayesian methods

- Bayesian methods don't have theoretical guarantees

One can often apply frequentist style generalization error bounds to Bayesian methods (e.g. PAC-Bayes). Moreover, it is often possible to prove convergence, consistency and rates for Bayesian methods

- Bayesian methods are generative

You can use Bayesian approaches for both generative and discriminative

learning (e.g. Gaussian process classification).

- Bayesian methods don't scale well

With the right inference methods (variational, MCMC) it is possible to

scale to very large datasets (e.g. excellent results for Bayesian Probabilistic Matrix Factorization on the Netfix dataset using MCMC), but it's true that averaging/integration is often more expensive than optimization.

http://cs229.stanford.edu/notes/cs229-notes8.pdf

EM in GMM

Kullback–Leibler divergence (E)

KL Exercise

and why ?

Entropy (M)

Variants

Generalized EM

- Maximization of lower bound it's not needed (only need to increase)

Variational EM

- relaxes the requirement, that q is posterior (intractable, like in LDA)

MCMC EM

- for me, similar to Variational, but different way of approximation.

Beta Function Generalization

LDA Details

Likelihood of document:

Likelihood of word:

Likelihood of corpus:

Fro Beta to Delta

LDA Details 2

LDA Details 3

LDA Details 4

src: http://156.17.130.153/~agonczarek/Lectures/lect2gm.pdf

https://www.cs.princeton.edu/~blei/papers/BleiMcAuliffe2007.pdf

http://www.cs.cmu.edu/~chongw/papers/WangBleiFeiFei2009.pdf

http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf

http://stat.columbia.edu/~porbanz/teaching/W4400/slides_final.pdf

http://stat.columbia.edu/~porbanz/teaching/W4400/slides_final.pdf

http://www.cs.toronto.edu/~hinton/csc2515/notes/lec6tutorial.pdf

http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf

http://156.17.130.153/~agonczarek/Lectures/lect1bayes.pdf

http://156.17.130.153/~agonczarek/Lectures/lect1bayes.pdf

http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf

http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf

http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf

http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf

http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf

http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf

http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf

http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf

http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf

http://www.cs.cmu.edu/~nasmith/LS2/gimpel.06.pdf

http://yosinski.com/mlss12/media/slides/MLSS-2012-Blei-Probabilistic-Topic-Models.pdf

http://yosinski.com/mlss12/media/slides/MLSS-2012-Blei-Probabilistic-Topic-Models.pdf

http://yosinski.com/mlss12/media/slides/MLSS-2012-Blei-Probabilistic-Topic-Models.pdf

http://yosinski.com/mlss12/media/slides/MLSS-2012-Blei-Probabilistic-Topic-Models.pdf

http://www.cs.columbia.edu/~jebara/4771/tutorials/lecture12.pdf

http://www.cs.columbia.edu/~jebara/4771/tutorials/lecture12.pdf

http://www.cs.columbia.edu/~jebara/4771/tutorials/lecture12.pdf

http://cs229.stanford.edu/notes/cs229-notes8.pdf

Jensen's inequality !

Latent variable z

Why is E-Step a

posterior?

http://pages.cs.wisc.edu/~andrzeje/research/em.pdf

http://pages.cs.wisc.edu/~andrzeje/research/em.pdf

http://pages.cs.wisc.edu/~andrzeje/research/em.pdf

http://www.cs.toronto.edu/~radford/res-bayes-ex.html