**Chordalysis**

**Motivation**

We need to know why!

**Take-home message**

We have very effective approaches for making predictions from data

Spam detection

Credit card fraud detection

Weather prediction

Cancer prediction

How good are we at understanding the why of our prediction systems?

"We predicted you that you are going to get

cancer

in 4 years.

What you can do is...

"

Temperature on Earth is likely to increase by 3°C,

because...

Data will be even more valuable when we know

how to take actions

that will make a difference.

**Scaling log-linear analysis to datasets with 1,000+ variables**

**Prioritized Chordalysis**

**Prioritized Chordalysis:**

**1. can analyse data with 1,000+ variables**

2. does not sacrifice the soundness

3. is released on

2. does not sacrifice the soundness

3. is released on

https://github.com/fpetitjean/

http://www.francois-petitjean.com

francois.petitjean@monash.edu

@LeDataMiner

...a very simple example

"Is

voting

for the presidential election independent of being a

member of an association

?"

Stats

Log-linear analysis

The null model

Vote

Member

How to compute E?

Expectation

The hypothesis

Vote

Member

Expectation

Let's do the maths

For multinomial sampling, the null model will be rejected when gets large.

... i.e. when

We have

we reject the null hypothesis

1. formulate hypotheses

Collect data

Vote

Member

?

2. Get data

3. Evaluation

Information theory

Vote

Member

Vote

Member

Models

Length

Model

Length

Data

How it

works

2 things have to be sent

1. What model it is?

2. The parameters of the model

There are 2 possible models

=>

1 bit

**François Petitjean**

, Ann Nicholson and Geoff Webb

, Ann Nicholson and Geoff Webb

e.g.

Send denominator 1473 (all models will have it)

Send 987

Model => 1 bit

2 parameters

=> 18.9 bits

Total = 19.9 bits

Model => 1 bit

3 parameters

=> 29.8 bits

Total=30.9 bits

If you know that 987 people voted, how many bits does it take to reconstitute the data re

Voted

?

This is an ordering problem:

Positions {

1

,

2

,

4

,

5

,

8

,...}

1

2

4

5

8

There are

orderings

Sending any takes

Total = 2,742 bits

Total = 2,690 bits

Vote

Member

Total

2762

bits

2720

bits

more probable than

Vote

Member

Vote

Member

Disclaimer: it's actually a bit more complicated than that, see our ICDM14 paper - Section III.A

No

**generalizing**

What have shown that...

Scalability to datasets with 100 variables is possible for the class of

decomposable models

ICDM 2013: statistical tests

ICDM 2014: minimum message length

What this paper shows:

We can gain

4 orders of magnitude

while getting

exactly

the same results.

Scoring

Assessing the addition of one edge to this model?

We only need to consider

4 cliques

Solution

42

bits difference

Log-linear analysis

Learning graphical models from data

1. Learn structure from data

2. Use the structure

Belief Propagation

Alice has difficulty walking

She had a

heart attack

10 years ago

She takes medication for her

high blood pressure

The doctor suspects Type 2 diabetes

p(Diabetes|...)=0.01

Female

60+

No

Yes

60-65kg

No

170+

Yes

Yes

Yes

Yes

Yes

Yes

No

No

Yes

No

Yes

?

?

Yes

No

No

?

Diabetes is not likely, the doctor decides to change her medication for high blood pressure

Studying correlations & independencies

In data:

The systolic blood pressure is independent of the weight, given

gender

diastolic blood pressure

medication for high blood pressure

A weight change is not likely to change the systolic blood pressure only

Information theory

Limitation

: process quadratic with the number of variables

1,000+ variables =>

several

days

of computation

1,000+ variables =>

1 minute

of computation

Scoring decomposable models...

... only relies upon local structures

Scoring model

32.5

Data

In our case, we only need...

Scoring addition of edge {a,b} to model

12.2

Data

{a,b}

Scoring

the addition of an edge to a model

: minimal separator of (a,b)

=

minimal set of vertices

that would

disconnect

a

from

b

if removed from the graph

= {c,d}

This has been proven for different scorings

Statistical tests [1]

MML/MDL [2]

KL divergence / log-likelihood [3]

[1]: F. Petitjean

et al.

, "Scaling log-linear analysis to high-dimensional data," in

ICDM 2013

.

[2]: F. Petitjean

et al.

, "A statistically efficient and scalable method for log-linear analysis of high-dimensional data," in

ICDM 2014

.

[3]: A. Deshpande

et al.

, "Efficient stepwise selection in decomposable models," in

UAI 2001

.

This means that what we are about to show stands for all these scorings.

What we have seen so far:

Evaluating the addition of an edge only depends upon 4 cliques of the graph

How often does that happen?

Our intuition

Not all edges should be re-examined at every step of the process

Select edge {g,h}

Score({a,b}) did

not

change

The addition of edge {a,b} need

not

be re-examined in the new model

How can we use this information?

We know:

if does not change between different modifications of the graph, then the addition of {a,b} need not be re-examined

1. Use a data structure that gives direct access to minimal separators for every potential edge

2. Keep track of the minimal separators for every potential edge

3. Maintain an ordered list of all the potential edges (priority queue)

Clique graph

BUT

those algorithms cannot track the minimal separators

We make this possible -

details in the paper

[1]: A. Deshpande

et al.

, "Efficient stepwise selection in decomposable models," in

UAI 2001

.

Keeping track of the minimal separators

We store an enriched association matrix

We store:

if {c,e} eligible for addition

the associated clique-graph edge

We modify the original algorithm to maintain this matrix

Priority queue

b

e

b

c

Add b-e

Add a-b

update a-e

Add e-f

update b-f

disable a-f

Add b-c

update a-c

update c-e

disable c-f

Add a-c

Add c-e

update a-e

enable c-f

Running times

#Vars

20

25

70

85

300

500

700

1,200

500

2,000

#variables

NYT dataset - 180k instances

Intuition

add edge {f,g}

There are algorithms that can directly update the clique-graph [1]

Log-linear analysis = one of THE standard methods in statistics

**Scaling log-linear analysis to datasets with 1,000+ variables**

**Thanks for your attention!**

Log-linear analysis

General framework: selecting a statistically significant log-linear model = superset of Markov Networks

Why doesn't it scale up?

Is the better fit of worth the increase of complexity?

Complexity

Fit

Exponential with the number of variables

1,000 binary variables

300

operations

atoms in the observable universe...

We want to do math!

1,000 binary variables

300

operations

atoms in the observable universe...

Known from data

How to scale up this evaluation while keeping its

statistical soundness

?

Requires to fit the model first

iterative optimisation

Models with a closed form

A

B

C

A

B

C

A

B

C

Closed form iff the graph is chordal (triangulated)

Why are decomposable models so

interesting

?

Properties:

2. Not a big restriction:

There is always a decomposable model that subsumes a non-decomposable model

4. Intersection between BN and MRF [Koller, 2009]

1. Closed form

3. MLE always exist [Agresti, 2002]

Chordalysis

1,000 binary variables

300

operations

atoms in the observable universe...

Chordalysis

log-linear analysis for decomposable models

Result #2

Comparing models that differ by one edge

Result #1

Proved

Chordalysis

what is missing to scale up to 1,000+ variables?

In our ICDM 2013 paper:

1. Elicitation of the edges that will keep the graph chordal

2. Efficient computation of maximal cliques and minimal separators

4. Efficient computation of marginal frequencies

5. Efficient computation and retrieval of marginal entropies

3. Multiple testing correction

A

B

D

C

**100 to 1,000**

Controlled environment

We need a controlled environment...

... because in reality, we never have the "ground truth".

Chordalysis

Good?

Bad?

... hard to tell for sure...

Experimental methodology

1. choose a structure

2. parametrize the model

3. sample data

Chordalysis

4. learn structure

5. compare structure

6. output stats

Results

Precision

Statistical

efficiency

Precision

Statistical

efficiency

Runtime

...

#mistakes

Real-world examples

Study of the elderly

25 variables

15,000 patients

Customer management

85 variables

6,000 clients

Computation: 2s

Computation: 2s

2007

**10 to 100**

...

ICDM-13

ICDM-14

SDM-15

KDD-16