### Present Remotely

Send the link below via email or IM

• Invited audience members will follow you as you navigate and present
• People invited to a presentation do not need a Prezi account
• This link expires 10 minutes after you close the presentation

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

You can change this under Settings & Account at any time.

# Statistical Framework

No description
by

## Grzegorz Gwardys

on 29 December 2013

Report abuse

#### Transcript of Statistical Framework

Before ...
EM not enough ...
Goal: approximate the full distribution
But why ?
Statistical Framework
What the ...
Bayesian Approach
Come back to LDA
Hierarchical Process
Conjugate priors
Exponential families
Latent variables
Collapsed Gibbs Sampling
MCMC
variational approximation
That's the purpose of LDA
Supervised Topic Models
Supervised LDA
Motivation ?
What is a probability of success of 3 heads
in 10 throws ?
(Maximum Likelihood assumed)

H T T T H T H T T T

Maximum Likelihood
Least Squares and ML ?
easy
interpretable
asymptotic properties
invariant under reparametrization
point esimate
reparametrization
Latent Variablels
Variational
Methods
Markov Chain Monte Carlo
Gibbs Sampling
Variational Methods
Bayesian Approach
A Bayesian approach to a problem starts with the formulation of a model that we hope is adequate to describe the situation of interest. We then formulated a prior distribution over the unknown parameters of the model, which is meant to capture our beliefs about the situation before seeing the data. After observing some data, we apply Bayes' Rule to obtain a posterior distribution for these unknowns, which takes account of both the prior and the data. From this posterior distribution we can compute predictive distributions for future observations.
But before that ...
Maximum Aposteriori
Aposteriori Distribution
BTW, ML is degenerated case of MAP (uniform prior)
Predictive Distribution
Beta Distribution
Conjugate Priors !
import

pymc
as
mc
from
gen_data_bern
import
getData

throws = getData()
p = mc.Uniform(
'p'
, lower=
0
, upper=
1
)
obs = mc.Bernoulli(
"obs"
, p, value=throws, observed=
True
)

import

pymc
as
mc
from
gen_data_bern
import
getData

throws = getData()
alpha = mc.Exponential(
'alpha'
, beta=
0.5
)
beta = mc.Normal(
'beta'
, mu=
5
, tau=
1
)
p = mc.Beta(
'p'
, alpha=alpha, beta=beta)
obs = mc.Bernoulli(
"obs"
, p, value=throws, observed=
True
)
import

pymc
as
mc
import
numpy
as
np
import
model_binomial

model = model_binomial
iterations =
25000
model = mc.MCMC(model, db =
'pickle'
)
model.sample(iter=iterations, burn=
5000
,thin=
2
)
print

model.stats()

#saving graph
mc.graph.graph(model)
import
matplotlib.pyplot
as
plt
mc.Matplot.plot(model)
plt.show()

{
'p'
: {'95% HPD interval': array([ 0.5933474 , 0.64185615]), 'n': 122500, 'quantiles': {2.5: 0.59344379128091884, 25: 0.60948935734400123, 50: 0.61793721629785492, 75: 0.62633076257775722, 97.5: 0.64201057731255884}, 'standard deviation': 0.012458537921094534, 'mc error': 6.7276335499864257e-05,
'mean': 0.61787796180912014
},
'beta'
: {'95% HPD interval': array([ 2.73917959, 6.60336035]), 'n': 122500, 'quantiles': {2.5: 2.7878098015030157, 25: 4.0646363496998505, 50: 4.7246670506308996, 75: 5.3858697946343597, 97.5: 6.6563463874649784}, 'standard deviation': 0.98090170439034674, 'mc error': 0.0050643509188001805, 'mean': 4.7251490063439183},
'alpha'
: {'95% HPD interval': array([ 0.38188356, 9.13365097]), 'n': 122500, 'quantiles': {2.5: 0.86469480018420042, 25: 2.6101111578571743, 50: 4.0243032728063861, 75: 5.7832094355613153, 97.5: 10.189414762984107}, 'standard deviation': 2.4364926610821929, 'mc error': 0.011861474172469305, 'mean': 4.4126960051489679}}

In future Generalized Linear Models ...
||x1 - x2|| = 999
||Ax1 - Ax2|| = 9.98999999835e-07
||x3 - x4|| = 0.1
||inv(A)x3 - inv(A)x4|| = 100000000.0

To overcome this problem, we
need to do something with matrix A:
regularization (come back in Maximum Aposteriori part)
Maximum Entropy Distribution (Gibbs Dis.)
solution with contraints:

http://stat.columbia.edu/~porbanz/
The Crucial Bit
Jensen + latent variable
||x1 - x2|| = 999
||Ax1 - Ax2|| = 9.98999999835e-07
||x3 - x4|| = 0.1
||inv(A)x3 - inv(A)x4|| = 100000000.0

Latent Dirichlet Allocation (LDA)
Graphical Models for Visual Object Recognition and Tracking (E. Sudderth)
Various graphical models
Standards (HMM)
2003
http://infolab.stanford.edu/~ullman/mmds/ch11.pdf
From coin to dice, from Beta to Dirichlet Distribution
2013-12-19 13:36:46,157 : INFO : calculating IDF weights for 9 documents and 12 features (28 matrix non-zeros)
[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555),
(6, 0.44424552527467476), (7, 0.3244870206138555)]
[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]

2013-12-19 13:36:46,159 : INFO : processed documents up to #9
2013-12-19 13:36:46,159 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" +
0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"
2013-12-19 13:36:46,160 : INFO : topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"time" + -0.320*"response" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"

TF-IDF
LSI Topics
number of times, when term t_i appears in document j
number of all terms in document j
number of documents
number of documents, where term i appears at least ones
documents
words
topics
words
topics
topics
documents
topics
http://d-nb.info/998079553/34
Latent Semantic Indexing
- Bayesian methods make assumptions where other methods don't
All methods make assumptions! Otherwise it's impossible to predict. Bayesian methods are transparent in their assumptions whereas other methods are often opaque.

- If you don't have the right prior you won't do well
Certainly a poor model will predict poorly but there is no such thing as
the right prior! Your model (both prior and likelihood) should capture a
reasonable range of possibilities. When in doubt you can choose vague priors (cf nonparametrics).

- Maximum A Posteriori (MAP) is a Bayesian method
MAP is similar to regularization and offers no particular Bayesian advantages.
The key ingredient in Bayesian methods is to average over your uncertain
variables and parameters, rather than to optimize.
Myths and misconceptions about Bayesian methods

Myths and misconceptions about Bayesian methods

- Bayesian methods don't have theoretical guarantees
One can often apply frequentist style generalization error bounds to Bayesian methods (e.g. PAC-Bayes). Moreover, it is often possible to prove convergence, consistency and rates for Bayesian methods

- Bayesian methods are generative
You can use Bayesian approaches for both generative and discriminative
learning (e.g. Gaussian process classification).

- Bayesian methods don't scale well
With the right inference methods (variational, MCMC) it is possible to
scale to very large datasets (e.g. excellent results for Bayesian Probabilistic Matrix Factorization on the Netfix dataset using MCMC), but it's true that averaging/integration is often more expensive than optimization.
http://cs229.stanford.edu/notes/cs229-notes8.pdf
EM in GMM
Kullback–Leibler divergence (E)
KL Exercise
and why ?
Entropy (M)
Variants
Generalized EM
- Maximization of lower bound it's not needed (only need to increase)
Variational EM
- relaxes the requirement, that q is posterior (intractable, like in LDA)
MCMC EM
- for me, similar to Variational, but different way of approximation.
Beta Function Generalization
LDA Details
Likelihood of document:
Likelihood of word:
Likelihood of corpus:
Fro Beta to Delta
LDA Details 2
LDA Details 3
LDA Details 4
src: http://156.17.130.153/~agonczarek/Lectures/lect2gm.pdf
https://www.cs.princeton.edu/~blei/papers/BleiMcAuliffe2007.pdf
http://www.cs.cmu.edu/~chongw/papers/WangBleiFeiFei2009.pdf
http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf
http://stat.columbia.edu/~porbanz/teaching/W4400/slides_final.pdf
http://stat.columbia.edu/~porbanz/teaching/W4400/slides_final.pdf
http://www.cs.toronto.edu/~hinton/csc2515/notes/lec6tutorial.pdf
http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf
http://156.17.130.153/~agonczarek/Lectures/lect1bayes.pdf
http://156.17.130.153/~agonczarek/Lectures/lect1bayes.pdf
http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf
http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf
http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf
http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf
http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf
http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf
http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf
http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf
http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf
http://www.cs.cmu.edu/~nasmith/LS2/gimpel.06.pdf
http://yosinski.com/mlss12/media/slides/MLSS-2012-Blei-Probabilistic-Topic-Models.pdf
http://yosinski.com/mlss12/media/slides/MLSS-2012-Blei-Probabilistic-Topic-Models.pdf
http://yosinski.com/mlss12/media/slides/MLSS-2012-Blei-Probabilistic-Topic-Models.pdf
http://yosinski.com/mlss12/media/slides/MLSS-2012-Blei-Probabilistic-Topic-Models.pdf
http://www.cs.columbia.edu/~jebara/4771/tutorials/lecture12.pdf
http://www.cs.columbia.edu/~jebara/4771/tutorials/lecture12.pdf
http://www.cs.columbia.edu/~jebara/4771/tutorials/lecture12.pdf
http://cs229.stanford.edu/notes/cs229-notes8.pdf
Jensen's inequality !
Latent variable z
Why is E-Step a
posterior?
http://pages.cs.wisc.edu/~andrzeje/research/em.pdf
http://pages.cs.wisc.edu/~andrzeje/research/em.pdf
http://pages.cs.wisc.edu/~andrzeje/research/em.pdf