Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.



No description

Cynthia Rudin

on 25 July 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of InterpretabilityPrezi4

Step 1:
Find Rules
Step 2:
Order Rules
Association Rules
- As long as Step 1 is sufficient, rely on Step 2 for modeling
- Usually a heuristic ordering of rules (works well in the non-noisy case)

or linear combination of rules, not interpretable
Decision Lists
- Heuristic ordering of rules
- Mostly theoretical
Machine Learning for
Interpretable Predictive Modeling
Cynthia Rudin

Associate Professor
CSAIL, Operations Research Center, Sloan School of Management
Massachusetts Institute of Technology
Collaborators: Ben Letham, Berk Ustun, Stefano Traca, Tyler McCormick, David Madigan

Want to produce predictive models that are:
Accurate and Interpretable
Concise and Convincing
Interpretability vs. Accuracy - a sacrifice?
Main Hypothesis:
Many real datasets permit powerful predictive models that can be surprisingly small.
Reliability Modeling
click rate
churn prediction
Personalized Medicine
Example: Predict whether I'll get stuck in traffic
IF game at fenway park THEN traffic (97/100)
ELSE IF not rush hour THEN no traffic (474/523)
ELSE IF no rain & no construction THEN no traffic (329/482)
ELSE IF Friday THEN no traffic (3/3)
ELSE IF rain THEN traffic (452/892)
OTHERWISE no traffic (10/15)
(same form as expert systems)
rush hour?
no traffic
no traffic
no traffic
grow and
Machine Learning for the New York City Power Grid.
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, No 2. February 2012.
C. Rudin, D. Waltz, R. Anderson, A. Boulanger, A. Salleb-Aouissi, M. Chow, H. Dutta, P. Gross, B. Huang, S. Ierome, D. Isaac, A. Kressner, R. Passonneau, A. Radeva, L. Wu
A Process for Predicting Manhole Events in Manhattan.
Machine Learning, Vol. 80, pages 1-31, 2010.
C. Rudin, R. Passonneau, A. Radeva, H. Dutta, S. Ierome, D. Isaac
C4.5 Quinlan 1993
CART Breiman 1984
Liu et al 1998, Li et al 2001, Yin and Han 2003, Simon et al 2011.
Review articles: Thabtah 2007, Ruckert 2008, Vanhoof and Depaire 2010
e.g., Friedman and Popescu 2008, Meinshausen 2010
Rivest 1987, Klivans and Servedio 2006, Anthony 2005,
Long and Servedio 2007, Marchand and Sokolova 2005
Bayesian Hierarchical Association Rule Modeling for Predicting Medical Conditions
T. McCormick, C. Rudin, D. Madigan
Annals of Applied Statistics, Vol. 6, No 2, pp. 652-668, 2012
Medical Condition Prediction
Predictions based on your medical history:
Logical Analysis of Data
- Works only in non-noisy case
For trials=1:500
Form training and test sets:
sample ~200 patients
for each patient, randomly split encounters into training and test
For each patient, iteratively make predictions on test encounters
get 1 point whenever our top 3 recommendations contain patient’s
next condition
- 43K patient encounters
- 2.3K patients, age > 40
- pre-existing conditions taken into account
Optimized Decision Lists
~43K doctor's visits
~2300 patients, age>40
- preexisting conditions handled
- rank rules by posterior mean
- iteratively predict the next
medical condition
-1 point if any of the top 3 predicted
conditions are right
Step 1:
Find Rules
Step 2:
Order Rules
MIP approach to both rule mining and rule ordering
High accuracy, completely robust to outliers
Can control tradeoff between accuracy and interpretability
Computationally intensive
Has optimality guarantee
Optimized Decision Lists (ODL)
Step 2: Optimize ordering of rules to maximize classification accuracy, regularized by size of the list.
Maximize classification accuracy
Maximize rank of highest default rule
(sparsity regularization)
Accuracy Comparison
Tic Tac Toe
Goal: Given a finished tic tac toe board, determine whether x has won the game
=.8873 +/- .0061
CART on Tic Tac Toe
C4.5 on Tic Tac Toe
=.9259 +/- .0066
=.8873 +/- .0061
Optimized Decision Lists on Tic Tac Toe
=1 +/- 0
Goal: predict whether a young person will commit a violent crime
*dataset from US Dept of Justice Statistics
more accurate than C4.5, CART, Ada, RF, SVM, etc.
This shows (to us at least) that there is no accuracy interpretability tradeoff
accuracy on par with others
sparse solutions
has optimality guarantee
robust to outliers
the only possible disadvantage is that it is computationally intensive
Sequential Event Prediction with Association Rules
C. Rudin, B. Letham, A. Salleb-Aouissi, E. Kogan and D. Madigan.
Proceedings of the 24th Conference on Computational Learning Theory, 2011
Adjusted Confidence
Can understand the whole algorithm
Very tractable
Reasonable choice for "cold start" problems
Step 1:
find rules
Step 2a:
compute adjusted
confidence of each rule
lhs rhs
Step 2b:
rank rules in decreasing
order of adjusted conf.
shrinks probability towards 0
encourages rules with higher support
reduces to traditional algorithm when K=0
K acts like a regularization parameter.
Larger K produces smaller lists. Larger K has a better generalization guarantee.
Smaller K can fit the data better, allows rare but powerful rules
This method is useful for "cold start" problems, but we can use more information to do better.
Building Interpretable Classifiers with Rules and Bayesian Analysis
B. Letham, C. Rudin, T. McCormick, D. Madigan
Bayesian List Machine
Bayesian List Machine
Bayesian List Machine
Produces a posterior of rule lists.
The prior provides interpretability. Favors small rule lists.
150 Metropolis-Hastings steps per second on Ben's laptop
Learning Rule Lists with Simulated Annealing
same goal (out-of-the-box classification)
uses simulated annealing rather than MIO

can be very tractable, but speed depends on the details of various prob. distributions within the algorithm, and warm start
Start with a rule list, and the set of other possible rules
for t=1..T
propose to insert or remove a rule from the rule list, or swap the default rule.
calculate objective
accept proposed move according to SA schedule
end for
restart the computation every once in a while
Goal: Beat decision trees.

Create predictive models that are highly accurate yet highly interpretable.

Hypothesis that many datasets permit predictive models that are surprisingly small.
(also see Holte 1993)

Presented several models, each with their own appeal
Optimized Decision Lists, Bayesian List Machine, HARM
applied to critical domains of criminology and personalized medicine

I would like to see interpretability to be considered a fundamental desirable quality in a model. Gaining insight is often an important means towards gaining predictive accuracy.
"default rule"
Example: Predict whether someone's income is >$50K
IF capital_gain> $7298 THEN >50K
ELSE IF married & capital_loss between $1887 and $1977 THEN >50K
ELSE IF education>13 years & married THEN >50K
Others: Logistic Regression, Random Forests
all rules apply separately to the patient, not a decision list
need to estimate conditional probabilities for each rule
use similar patients to "borrow strength"
high blood pressure stroke
spiculated margin malignant (94.0/111.0)
age>57 and ill-defined margin malignant (104.0/121.0)
irregular shape malignant (94.0/126.0)
otherwise benign (344.0/411.0)
Mammographic Mass dataset
Algorithm: BDL SVM C4.5 CART logistic regression
Accuracy: 0.7604 0.7292 0.7448 0.7135 0.7396
*BDT also produces perfect classifier for tic tac toe

Optimized Decision Lists
Uses Mixed Integer Optimization

Bayesian List Machine
Bayesian Analysis, produces a full posterior of rule lists

Hierarchical Association Rule Model (HARM)
Estimates rare rule probabilities by borrowing strength

Overall point: We can automate ways to gain insight into scientific data through predictive modeling. There are many methods to do this, so let's explore.
Goal: Measure stroke risk among patients with atrial fibrillation (Beat the CHADS2 Score)
CHADS2 was originally developed using data from 1,733 Medicare beneficiaries.
Considers: congestive heart failure, hypertension, age, diabetes, and prior stroke.
The number of people who had the maximum score was ~5.

We started with 11.1 million Medicaid enrollees, took those with a diagnosis of atrial fibrillation, one-year of atrial fibrillation-free observation time prior to the diagnosis, and one year of observation time following the diagnosis (~12.5K).

We considered all drugs and conditions.
Our current "CHADS2"
dyspnoea - shortness of breath
levoflaxin - antibiotic
cardiomegaly - enlarged heart
lisinopril - high blook pressure med
hydrochlorothiazide - high blood pressure and fluid retention med
cephalexin - antibiotic
*haven't been able to get C4.5 or CART to run on these data.
ranks rules
According to learning theory, interpretability can
with prediction accuracy, they are not necessarily at odds.

Imbalanced data: when there is more of one class than another. One class becomes more important.

Constant determines sensitivity/specificity tradeoff
IF Service Box AND Was the trouble hole for a non-serious event sometime in the past

ELSE IF Service Box AND Level 1 repair was made during most recent inspection AND the repair was at least 3 years ago with no involvement in events after the inspection

ELSE IF Most recent inspection was clean with no involvement in events after the inspection AND Number of main phase cables < 30

Imbalanced Data
364/4437 (8.2%) 298/4516 (6.6%)

2227/2321 (4.1%) 2436/2531 (3.8%)

2058/2155 (4.5%) 2627/2729 (3.7%)

438/7608 (5.8%) 337/6638 (5.1%)
Vulnerability for
Vulnerability for
Manhole Event Data
Goal: Predict serious manhole events
For trials=1:500Form training and test sets:sample ~200 patientsfor each patient, randomly split encounters into training and testFor each patient, iteratively make predictions on test encountersget 1 point whenever our top 3 recommendations contain patient’s next condition
Goal is to prevent electrical grid failures
Massive amount of diverse, historical data, need to make sense of it
Prediction relies critically on data insight
This is an essential domain in which ML can make a difference.
The Grid needs to be maintained
Oct 19, 2005
Brooklyn NY
21st-Century Data Miners Meet 19th-Century Electrical Cables.
IEEE Computer, volume 44 no. 6, pages 103-105, June 2011.
(One of three articles featured on the cover.)
C. Rudin, R. Passonneau, A. Radeva, S. Ierome, D. Isaac
Columbia/Con Edison collaboration to use data mining/ML for smart power grid maintenance
If Hemiplegia stroke risk 58.0%
Elseif Prior stroke stroke risk 46.6%
Elseif Prior TIA and Hypertension stroke risk 23.2%
Elseif Vascular disease stroke risk 16.4%
Elseif Age<=60 stroke risk 3.7%
OTHERWISE stroke risk 8.5%
*Winner of 2013 INFORMS Innovative Applications in Analytics Award
- Winner of Data Mining Best Student Paper Competition, INFORMS 2013
- Winner of Student Paper Competition, Statistical Learning and Data Mining Section, American Statistical Association, 2014
Bayesian List Machine - Short Summary
Supersparse Linear Integer Models for Interpretable Classification
B. Ustun, S. Traca, C. Rudin
Scoring Systems
- Sparse linear models
- Coefficients in a predefined discrete set for interpretability
- Maximize both accuracy and interpretability directly
Only use L1 norm to choose from equally sparse solutions. It forces at least two of the coefficients to be coprime.
Solvable for N ~ 10000 and P ~ 100 in minutes
SLIM Formulation
With Useful Coefficient Sets
bounded integers
sign-constrained integers
one significant digit
two significant digits
Violent Crime Scoring System
Some Comparative Results
SLIM solution dominates all LARS' solutions, both in accuracy and interpretability,
even with integer coefficients
- SLIM's results are competitive even with the restriction to integer solutions.
- It's because we use a better loss function
Liu et al 1998, Li et al 2001, Yin and Han 2003, Simon et al 2011.
Reviews: Thabtah 2007, Ruckert 2008, Vanhoof and Depaire 2010
e.g. Friedman and Popescu 2008, Meinshausen 2010
"This simplicity gets at the important issue: A decent transparent model that is actually used will outperform a sophisticated system that predicts better but sits on a shelf. If the researchers had created a model that predicted well but was more complicated, the LAPD likely would have ignored it, thus defeating the whole purpose."

--- Greg Ridgeway, Director of National Institute of Justice
LAPD Recruits Model
Full transcript