**Interpretability?**

Step 1:

Find Rules

Step 2:

Order Rules

Association Rules

- As long as Step 1 is sufficient, rely on Step 2 for modeling

- Usually a heuristic ordering of rules (works well in the non-noisy case)

or linear combination of rules, not interpretable

Decision Lists

- Heuristic ordering of rules

- Mostly theoretical

Machine Learning for

Interpretable Predictive Modeling

Cynthia Rudin

Associate Professor

CSAIL, Operations Research Center, Sloan School of Management

Massachusetts Institute of Technology

Collaborators: Ben Letham, Berk Ustun, Stefano Traca, Tyler McCormick, David Madigan

Want to produce predictive models that are:

Accurate and Interpretable

Concise and Convincing

Interpretability vs. Accuracy - a sacrifice?

Main Hypothesis:

Many real datasets permit powerful predictive models that can be surprisingly small.

Reliability Modeling

Marketing

click rate

churn prediction

Personalized Medicine

Science

modeling

understanding

Example: Predict whether I'll get stuck in traffic

IF game at fenway park THEN traffic (97/100)

ELSE IF not rush hour THEN no traffic (474/523)

ELSE IF no rain & no construction THEN no traffic (329/482)

ELSE IF Friday THEN no traffic (3/3)

ELSE IF rain THEN traffic (452/892)

OTHERWISE no traffic (10/15)

(same form as expert systems)

rain?

Y

N

traffic

rush hour?

Y

N

traffic

construction?

fenway?

no traffic

traffic

Y

no traffic

Friday?

N

N

Y

no traffic

Trees:

grow and

prune

Machine Learning for the New York City Power Grid.

IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, No 2. February 2012.

C. Rudin, D. Waltz, R. Anderson, A. Boulanger, A. Salleb-Aouissi, M. Chow, H. Dutta, P. Gross, B. Huang, S. Ierome, D. Isaac, A. Kressner, R. Passonneau, A. Radeva, L. Wu

A Process for Predicting Manhole Events in Manhattan.

Machine Learning, Vol. 80, pages 1-31, 2010.

C. Rudin, R. Passonneau, A. Radeva, H. Dutta, S. Ierome, D. Isaac

C4.5 Quinlan 1993

CART Breiman 1984

Liu et al 1998, Li et al 2001, Yin and Han 2003, Simon et al 2011.

Review articles: Thabtah 2007, Ruckert 2008, Vanhoof and Depaire 2010

e.g., Friedman and Popescu 2008, Meinshausen 2010

Rivest 1987, Klivans and Servedio 2006, Anthony 2005,

Long and Servedio 2007, Marchand and Sokolova 2005

3

Bayesian Hierarchical Association Rule Modeling for Predicting Medical Conditions

T. McCormick, C. Rudin, D. Madigan

Annals of Applied Statistics, Vol. 6, No 2, pp. 652-668, 2012

Medical Condition Prediction

Predictions based on your medical history:

Logical Analysis of Data

- Works only in non-noisy case

For trials=1:500

Form training and test sets:

sample ~200 patients

for each patient, randomly split encounters into training and test

For each patient, iteratively make predictions on test encounters

get 1 point whenever our top 3 recommendations contain patient’s

next condition

- 43K patient encounters

- 2.3K patients, age > 40

- pre-existing conditions taken into account

Experiment

HARM

1

Optimized Decision Lists

~43K doctor's visits

~2300 patients, age>40

- preexisting conditions handled

- rank rules by posterior mean

- iteratively predict the next

medical condition

-1 point if any of the top 3 predicted

conditions are right

HARM

Adj.

Conf.

traditional

algorithm

baseline

Step 1:

Find Rules

Step 2:

Order Rules

MIP approach to both rule mining and rule ordering

High accuracy, completely robust to outliers

Can control tradeoff between accuracy and interpretability

Computationally intensive

Has optimality guarantee

Optimized Decision Lists (ODL)

Step 2: Optimize ordering of rules to maximize classification accuracy, regularized by size of the list.

:

Thursday=1

:

otherwise

1

1

Maximize classification accuracy

Maximize rank of highest default rule

(sparsity regularization)

Datasets

Accuracy Comparison

Algorithms

Tic Tac Toe

Goal: Given a finished tic tac toe board, determine whether x has won the game

accuracy

=.8873 +/- .0061

CART on Tic Tac Toe

C4.5 on Tic Tac Toe

accuracy

=.9259 +/- .0066

accuracy

=.8873 +/- .0061

Optimized Decision Lists on Tic Tac Toe

accuracy

=1 +/- 0

Crime

Goal: predict whether a young person will commit a violent crime

*dataset from US Dept of Justice Statistics

more accurate than C4.5, CART, Ada, RF, SVM, etc.

This shows (to us at least) that there is no accuracy interpretability tradeoff

accuracy on par with others

sparse solutions

has optimality guarantee

robust to outliers

the only possible disadvantage is that it is computationally intensive

1

Sequential Event Prediction with Association Rules

C. Rudin, B. Letham, A. Salleb-Aouissi, E. Kogan and D. Madigan.

Proceedings of the 24th Conference on Computational Learning Theory, 2011

Adjusted Confidence

Can understand the whole algorithm

Very tractable

Reasonable choice for "cold start" problems

Step 1:

find rules

Step 2a:

compute adjusted

confidence of each rule

lhs rhs

Step 2b:

rank rules in decreasing

order of adjusted conf.

shrinks probability towards 0

encourages rules with higher support

reduces to traditional algorithm when K=0

K acts like a regularization parameter.

Larger K produces smaller lists. Larger K has a better generalization guarantee.

Smaller K can fit the data better, allows rare but powerful rules

This method is useful for "cold start" problems, but we can use more information to do better.

2

Building Interpretable Classifiers with Rules and Bayesian Analysis

B. Letham, C. Rudin, T. McCormick, D. Madigan

Bayesian List Machine

Bayesian List Machine

Bayesian List Machine

Produces a posterior of rule lists.

The prior provides interpretability. Favors small rule lists.

150 Metropolis-Hastings steps per second on Ben's laptop

2

Learning Rule Lists with Simulated Annealing

RulesSA

RulesSA

same goal (out-of-the-box classification)

uses simulated annealing rather than MIO

minimizes:

can be very tractable, but speed depends on the details of various prob. distributions within the algorithm, and warm start

Algorithm

Start with a rule list, and the set of other possible rules

for t=1..T

propose to insert or remove a rule from the rule list, or swap the default rule.

calculate objective

accept proposed move according to SA schedule

end for

restart the computation every once in a while

Goal: Beat decision trees.

Create predictive models that are highly accurate yet highly interpretable.

Hypothesis that many datasets permit predictive models that are surprisingly small.

(also see Holte 1993)

Presented several models, each with their own appeal

Optimized Decision Lists, Bayesian List Machine, HARM

applied to critical domains of criminology and personalized medicine

I would like to see interpretability to be considered a fundamental desirable quality in a model. Gaining insight is often an important means towards gaining predictive accuracy.

"nugget"

"default rule"

:

Thursday=1

:

otherwise

1

1

Example: Predict whether someone's income is >$50K

IF capital_gain> $7298 THEN >50K

ELSE IF married & capital_loss between $1887 and $1977 THEN >50K

ELSE IF education>13 years & married THEN >50K

OTHERWISE not >$50K

Others: Logistic Regression, Random Forests

K=0

K=5

all rules apply separately to the patient, not a decision list

need to estimate conditional probabilities for each rule

use similar patients to "borrow strength"

high blood pressure stroke

spiculated margin malignant (94.0/111.0)

age>57 and ill-defined margin malignant (104.0/121.0)

irregular shape malignant (94.0/126.0)

otherwise benign (344.0/411.0)

Mammographic Mass dataset

Algorithm: BDL SVM C4.5 CART logistic regression

Accuracy: 0.7604 0.7292 0.7448 0.7135 0.7396

*BDT also produces perfect classifier for tic tac toe

Outline

Optimized Decision Lists

Uses Mixed Integer Optimization

Bayesian List Machine

Bayesian Analysis, produces a full posterior of rule lists

Hierarchical Association Rule Model (HARM)

Estimates rare rule probabilities by borrowing strength

Overall point: We can automate ways to gain insight into scientific data through predictive modeling. There are many methods to do this, so let's explore.

Goal: Measure stroke risk among patients with atrial fibrillation (Beat the CHADS2 Score)

CHADS2 was originally developed using data from 1,733 Medicare beneficiaries.

Considers: congestive heart failure, hypertension, age, diabetes, and prior stroke.

The number of people who had the maximum score was ~5.

We started with 11.1 million Medicaid enrollees, took those with a diagnosis of atrial fibrillation, one-year of atrial fibrillation-free observation time prior to the diagnosis, and one year of observation time following the diagnosis (~12.5K).

We considered all drugs and conditions.

Our current "CHADS2"

dyspnoea - shortness of breath

levoflaxin - antibiotic

cardiomegaly - enlarged heart

lisinopril - high blook pressure med

hydrochlorothiazide - high blood pressure and fluid retention med

cephalexin - antibiotic

*haven't been able to get C4.5 or CART to run on these data.

ranks rules

According to learning theory, interpretability can

help

with prediction accuracy, they are not necessarily at odds.

Imbalanced data: when there is more of one class than another. One class becomes more important.

Constant determines sensitivity/specificity tradeoff

IF Service Box AND Was the trouble hole for a non-serious event sometime in the past

ELSE IF Service Box AND Level 1 repair was made during most recent inspection AND the repair was at least 3 years ago with no involvement in events after the inspection

ELSE IF Most recent inspection was clean with no involvement in events after the inspection AND Number of main phase cables < 30

OTHERWISE

Imbalanced Data

364/4437 (8.2%) 298/4516 (6.6%)

2227/2321 (4.1%) 2436/2531 (3.8%)

2058/2155 (4.5%) 2627/2729 (3.7%)

438/7608 (5.8%) 337/6638 (5.1%)

Vulnerability for

Training

Vulnerability for

Testing

Manhole Event Data

Goal: Predict serious manhole events

For trials=1:500Form training and test sets:sample ~200 patientsfor each patient, randomly split encounters into training and testFor each patient, iteratively make predictions on test encountersget 1 point whenever our top 3 recommendations contain patient’s next condition

Goal is to prevent electrical grid failures

Massive amount of diverse, historical data, need to make sense of it

Prediction relies critically on data insight

This is an essential domain in which ML can make a difference.

The Grid needs to be maintained

Oct 19, 2005

Brooklyn NY

21st-Century Data Miners Meet 19th-Century Electrical Cables.

IEEE Computer, volume 44 no. 6, pages 103-105, June 2011.

(One of three articles featured on the cover.)

C. Rudin, R. Passonneau, A. Radeva, S. Ierome, D. Isaac

Columbia/Con Edison collaboration to use data mining/ML for smart power grid maintenance

Summary

If Hemiplegia stroke risk 58.0%

Elseif Prior stroke stroke risk 46.6%

Elseif Prior TIA and Hypertension stroke risk 23.2%

Elseif Vascular disease stroke risk 16.4%

Elseif Age<=60 stroke risk 3.7%

OTHERWISE stroke risk 8.5%

Criminology/Policing

*Winner of 2013 INFORMS Innovative Applications in Analytics Award

- Winner of Data Mining Best Student Paper Competition, INFORMS 2013

- Winner of Student Paper Competition, Statistical Learning and Data Mining Section, American Statistical Association, 2014

Bayesian List Machine - Short Summary

3

Supersparse Linear Integer Models for Interpretable Classification

B. Ustun, S. Traca, C. Rudin

Scoring Systems

SLIM

- Sparse linear models

- Coefficients in a predefined discrete set for interpretability

- Maximize both accuracy and interpretability directly

Only use L1 norm to choose from equally sparse solutions. It forces at least two of the coefficients to be coprime.

Solvable for N ~ 10000 and P ~ 100 in minutes

SLIM Formulation

With Useful Coefficient Sets

bounded integers

sign-constrained integers

one significant digit

two significant digits

Violent Crime Scoring System

Some Comparative Results

SLIM solution dominates all LARS' solutions, both in accuracy and interpretability,

even with integer coefficients

- SLIM's results are competitive even with the restriction to integer solutions.

- It's because we use a better loss function

Spambase

Liu et al 1998, Li et al 2001, Yin and Han 2003, Simon et al 2011.

Reviews: Thabtah 2007, Ruckert 2008, Vanhoof and Depaire 2010

e.g. Friedman and Popescu 2008, Meinshausen 2010

ODL

"This simplicity gets at the important issue: A decent transparent model that is actually used will outperform a sophisticated system that predicts better but sits on a shelf. If the researchers had created a model that predicted well but was more complicated, the LAPD likely would have ignored it, thus defeating the whole purpose."

--- Greg Ridgeway, Director of National Institute of Justice

LAPD Recruits Model