Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Seminar Chordalysis

June 2015
by

Francois Petitjean

on 15 August 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Seminar Chordalysis

Chordalysis
Motivation
We need to know why!
Take-home message
We have very effective approaches for making predictions from data
Spam detection
Credit card fraud detection
Weather prediction
Cancer prediction
How good are we at understanding the why of our prediction systems?
"We predicted you that you are going to get
cancer
in 4 years.
What you can do is...
"
Temperature on Earth is likely to increase by 3°C,
because...

Data will be even more valuable when we know
how to take actions
that will make a difference.
Scaling log-linear analysis to datasets with 1,000+ variables
Prioritized Chordalysis
Prioritized Chordalysis:
1. can analyse data with 1,000+ variables
2. does not sacrifice the soundness
3. is released on

https://github.com/fpetitjean/
http://www.francois-petitjean.com
francois.petitjean@monash.edu
@LeDataMiner
...a very simple example
"Is
voting
for the presidential election independent of being a
member of an association
?"
Stats
Log-linear analysis
The null model
Vote
Member
How to compute E?
Expectation
The hypothesis
Vote
Member
Expectation
Let's do the maths
For multinomial sampling, the null model will be rejected when gets large.
... i.e. when
We have
we reject the null hypothesis
1. formulate hypotheses
Collect data
Vote
Member
?
2. Get data
3. Evaluation
Information theory
Vote
Member
Vote
Member
Models
Length
Model
Length
Data
How it
works
2 things have to be sent
1. What model it is?
2. The parameters of the model
There are 2 possible models
=>
1 bit
François Petitjean
, Ann Nicholson and Geoff Webb

e.g.
Send denominator 1473 (all models will have it)
Send 987
Model => 1 bit
2 parameters
=> 18.9 bits
Total = 19.9 bits
Model => 1 bit
3 parameters
=> 29.8 bits
Total=30.9 bits
If you know that 987 people voted, how many bits does it take to reconstitute the data re
Voted
?
This is an ordering problem:
Positions {
1
,
2
,
4
,
5
,
8
,...}
1
2
4
5
8
There are
orderings
Sending any takes
Total = 2,742 bits
Total = 2,690 bits
Vote
Member
Total
2762
bits
2720
bits
more probable than
Vote
Member
Vote
Member
Disclaimer: it's actually a bit more complicated than that, see our ICDM14 paper - Section III.A
No
generalizing
What have shown that...
Scalability to datasets with 100 variables is possible for the class of
decomposable models
ICDM 2013: statistical tests
ICDM 2014: minimum message length
What this paper shows:
We can gain
4 orders of magnitude
while getting
exactly
the same results.

Scoring
Assessing the addition of one edge to this model?
We only need to consider
4 cliques
Solution
42
bits difference
Log-linear analysis
Learning graphical models from data
1. Learn structure from data
2. Use the structure
Belief Propagation
Alice has difficulty walking
She had a
heart attack
10 years ago
She takes medication for her
high blood pressure
The doctor suspects Type 2 diabetes
p(Diabetes|...)=0.01
Female
60+
No
Yes
60-65kg
No
170+
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
No
Yes
?
?
Yes
No
No
?
Diabetes is not likely, the doctor decides to change her medication for high blood pressure
Studying correlations & independencies
In data:
The systolic blood pressure is independent of the weight, given
gender
diastolic blood pressure
medication for high blood pressure
A weight change is not likely to change the systolic blood pressure only
Information theory
Limitation
: process quadratic with the number of variables
1,000+ variables =>
several

days
of computation
1,000+ variables =>
1 minute
of computation
Scoring decomposable models...
... only relies upon local structures
Scoring model
32.5
Data
In our case, we only need...
Scoring addition of edge {a,b} to model
12.2
Data
{a,b}
Scoring
the addition of an edge to a model
: minimal separator of (a,b)
=
minimal set of vertices
that would
disconnect

a
from
b
if removed from the graph
= {c,d}
This has been proven for different scorings
Statistical tests [1]

MML/MDL [2]

KL divergence / log-likelihood [3]
[1]: F. Petitjean
et al.
, "Scaling log-linear analysis to high-dimensional data," in
ICDM 2013
.
[2]: F. Petitjean
et al.
, "A statistically efficient and scalable method for log-linear analysis of high-dimensional data," in
ICDM 2014
.
[3]: A. Deshpande
et al.
, "Efficient stepwise selection in decomposable models," in
UAI 2001
.
This means that what we are about to show stands for all these scorings.
What we have seen so far:
Evaluating the addition of an edge only depends upon 4 cliques of the graph
How often does that happen?
Our intuition
Not all edges should be re-examined at every step of the process
Select edge {g,h}
Score({a,b}) did
not
change
The addition of edge {a,b} need
not

be re-examined in the new model
How can we use this information?
We know:
if does not change between different modifications of the graph, then the addition of {a,b} need not be re-examined
1. Use a data structure that gives direct access to minimal separators for every potential edge

2. Keep track of the minimal separators for every potential edge

3. Maintain an ordered list of all the potential edges (priority queue)
Clique graph
BUT
those algorithms cannot track the minimal separators
We make this possible -
details in the paper
[1]: A. Deshpande
et al.
, "Efficient stepwise selection in decomposable models," in
UAI 2001
.
Keeping track of the minimal separators
We store an enriched association matrix
We store:
if {c,e} eligible for addition
the associated clique-graph edge

We modify the original algorithm to maintain this matrix
Priority queue
b
e
b
c
Add b-e
Add a-b
update a-e

Add e-f
update b-f
disable a-f

Add b-c
update a-c
update c-e
disable c-f

Add a-c
Add c-e
update a-e
enable c-f

Running times
#Vars
20
25
70
85
300
500
700
1,200
500
2,000
#variables
NYT dataset - 180k instances
Intuition
add edge {f,g}
There are algorithms that can directly update the clique-graph [1]
Log-linear analysis = one of THE standard methods in statistics
Scaling log-linear analysis to datasets with 1,000+ variables
Thanks for your attention!
Log-linear analysis
General framework: selecting a statistically significant log-linear model = superset of Markov Networks
Why doesn't it scale up?
Is the better fit of worth the increase of complexity?
Complexity
Fit
Exponential with the number of variables
1,000 binary variables
300
operations
atoms in the observable universe...
We want to do math!
1,000 binary variables
300
operations
atoms in the observable universe...
Known from data
How to scale up this evaluation while keeping its
statistical soundness
?
Requires to fit the model first
iterative optimisation
Models with a closed form
A
B
C
A
B
C
A
B
C
Closed form iff the graph is chordal (triangulated)
Why are decomposable models so
interesting
?
Properties:
2. Not a big restriction:
There is always a decomposable model that subsumes a non-decomposable model
4. Intersection between BN and MRF [Koller, 2009]

1. Closed form
3. MLE always exist [Agresti, 2002]
Chordalysis
1,000 binary variables
300
operations
atoms in the observable universe...
Chordalysis
log-linear analysis for decomposable models
Result #2
Comparing models that differ by one edge
Result #1
Proved
Chordalysis
what is missing to scale up to 1,000+ variables?
In our ICDM 2013 paper:
1. Elicitation of the edges that will keep the graph chordal
2. Efficient computation of maximal cliques and minimal separators
4. Efficient computation of marginal frequencies
5. Efficient computation and retrieval of marginal entropies
3. Multiple testing correction
A
B
D
C
100 to 1,000
Controlled environment
We need a controlled environment...
... because in reality, we never have the "ground truth".
Chordalysis
Good?
Bad?
... hard to tell for sure...
Experimental methodology
1. choose a structure
2. parametrize the model
3. sample data
Chordalysis
4. learn structure
5. compare structure
6. output stats
Results
Precision
Statistical
efficiency
Precision
Statistical
efficiency
Runtime
...
#mistakes
Real-world examples
Study of the elderly
25 variables
15,000 patients
Customer management
85 variables
6,000 clients
Computation: 2s
Computation: 2s
2007
10 to 100
...
ICDM-13
ICDM-14
SDM-15
KDD-16
Full transcript