**Falling Rule Lists**

Hospital

Readmission

Prediction

Goal: Predict 30 day readmissions

Data from Mass General Hospital collaborators:

- ~8,000 patients who had no prior history of readmissions

- Features are very detailed, e.g., "impaired mental status," "difficult behavior," "chronic pain," "feels unsafe," etc.

- Labels for each patient, whether readmitted or not

- List size parameter set at 8 rules, no parameter tuning

**Bayesian Rule Lists**

**The Bayesian Case Model**

**Other Models**

"This simplicity gets at the important issue: A decent transparent model that is actually used will outperform a sophisticated system that predicts better but sits on a shelf. If the researchers had created a model that predicted well but was more complicated, the LAPD likely would have ignored it, thus defeating the whole purpose."

--- Greg Ridgeway, Director of National Institute of Justice

LAPD Recruits Model

Goal: Beat CART (Classification and Regression Trees)

C4.5 Quinlan 1993

CART Breiman 1984

Cart is the "go to" method in industry for interpretable classification.

Works well but is greedy.

Sacrifices both accuracy and interpretability

rain?

Y

N

traffic

rush hour?

Y

N

traffic

construction?

construction?

no traffic

traffic

Y

no traffic

Friday?

N

N

Y

no traffic

I hypothesize that many real problems are

bowl-shaped.

This is flat

Classification Error

Models

If I'm right, there exist many very sparse yet very accurate models. This means we can find these models if we look hard enough.

If I'm right, there might be many different decisions corresponding to the range of reasonable models.

Related to the Rashomon Effect of Breiman

Example: Predict whether I'll get stuck in traffic

IF game at fenway park THEN traffic with 97% probability (97/100)

ELSE IF not rush hour THEN no traffic with 90% probability (474/523)

ELSE IF no rain & no construction THEN no traffic with 68% prob (329/482)

ELSE IF Friday THEN no traffic with 75% probability (30/40)

ELSE IF rain THEN traffic with 51% probability (452/892)

OTHERWISE no traffic with 66% probability (10/15)

(same form as expert systems)

Building Interpretable Classifiers with Rules and Bayesian Analysis

B. Letham, C. Rudin, T. McCormick, D. Madigan

- Winner of Data Mining Best Student Paper Competition, INFORMS 2013

- Winner of Student Paper Competition, Statistical Learning and Data Mining Section, American Statistical Association, 2014

Here's the model

If diabetes and hypertension ----> stroke

Else If age < 40 and tylenol ----> no stroke

Else If obese and hypertension ---> stroke

Else If age >50 and aspirin -----> no stroke

Else if Betapace -----> no stroke

Otherwise predict stroke

data

rules

user-defined

list

User determines length

- Restriction to discovered itemsets is the approximation

(avoids large NP hard problem)

- Might make the model more interpretable

- Complexity of model class is similar to decision trees

(interpretability is an illusion)

User determines width

User's preferred probabilties

Step 1:

Mine Frequent Itemsets

Step 2:

Learn Permutation of Itemsets and Probabilities for Rules

age<75 & hip fracture

age>75

past stroke & warfarin

diabetes & hypertension

history of falls & age > 60

IF age > 75 & days_in_hospital >10 THEN readmission risk is 31%

ELSE IF poor prognosis THEN readmission risk is 25%

ELSE IF pain scale > 8 THEN readmission risk is 13%

: : : :

ELSE readmission risk is 1%

Bayesian Rule Lists

Input:

Model:

Metropolis-Hastings Sampling:

- SWAP - switch two rules

- ADD - insert a rule

- REMOVE - delete a rule

Goal: Measure stroke risk among patients with atrial fibrillation (Beat the CHADS2 Score)

>1000 times the amount of data

CHADS2 was originally developed using data from 1,733 Medicare beneficiaries.

5 features

5 people had the maximum score.

We started with 11.1 million Medicaid enrollees, took those with a diagnosis of atrial fibrillation, one-year of atrial fibrillation-free observation time prior to the diagnosis, and one year of observation time following the diagnosis (12,586 patients).

We considered all drugs and conditions (4,146)

2,200 pre-mined rules

Our current "CHADS2"

If Hemiplegia stroke risk 58.9%

Elseif Cerebrovascular disorder stroke risk 47.8%

Elseif Prior TIA stroke risk 23.8%

Elseif Occlusion/stenosis of carotid artery stroke risk 16.4%

Elseif Altered state of consciousness

and age > 60 stroke risk 16%

Elseif age <= 70 stroke risk 4.6%

OTHERWISE stroke risk 8.7%

(Rudin, Letham, Madigan, JMLR 2013) has theorems about:

- complexity (VC dimension) of set of decision lists composed of if-then rules

- stability of decision lists (leading to performance guarantees)

Tic Tac Toe

Goal: Given a finished tic tac toe board, determine whether x has won the game

accuracy

=.8873 +/- .0061

CART on Tic Tac Toe

C4.5 on Tic Tac Toe

accuracy

=.9259 +/- .0066

accuracy

=.8873 +/- .0061

Bayesian Rule Lists on Tic Tac Toe

accuracy

=1 +/- 0

F. Wang and C. Rudin. AISTATS, 2015

- Winner of Student Paper Competition, Statistical Learning and Data Mining Section, American Statistical Association, 2015

Falling Rule Lists

Falling Rule Lists serve two purposes:

- Rank rules to form a predictive model

- Stratify patients into decreasing risk categories

Easier for decision makers to assess risks

The most at-risk patients should be handled with the highest priority.

FRL are a type of decision tree. Decision tree methods do not produce FRL.

Not greedy, computationally more difficult. Model above created in 35 seconds.

User determines length

User determines preferences over rules

User preference over gaps between probabilities

The Bayesian Case Model: A Generative Approach for

Case-Based Reasoning and Prototype Classification

B. Kim, C. Rudin, and J. Shah, NIPS 2014

This cluster is represented by...

...this quintessential observation (the prototype)...

...and here are the important characteristics (the subspace).

Computational Criminology

- crime pattern detection (subspace clustering)

- recidivism prediction (interpretable modeling)

Healthcare

The Latent State Hazard Model

(with Ramin Moghaddass)

Separates out the latent (internal) vulnerability from vulnerability due to external sources for interpretability

SCADA data from large Italian wind farm through partnership with ENEL/Accenture. Streaming measurements every 10 minutes from 44 sensors on each turbine.

time

total hazard

latent degradation state

influence of external factors

Energy Reliability

Combining Predictions with Decision-Making

(Machine Learning and Optimization)

Theja Tulabandhula and Cynthia Rudin. Machine Learning with Operational Costs. Journal of Machine Learning Research, Volume 14, pages 1989-2028, 2013.

Theja Tulabandhula and Cynthia Rudin. On Combining Machine Learning with Decision Making. Machine Learning, Volume 97, Issue 1-2, pp 33-64, 2014.

Theja Tulabandhula and Cynthia Rudin. Robust Optimization using Machine Learning for Uncertainty Sets. Proceedings of the International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2014.

Theja Tulabandhula and Cynthia Rudin. Generalization Bounds for Learning with Linear, Polygonal, Quadratic, and Conic Side Knowledge. Machine Learning , 2014.

Internet and Marketing

mammographic mass dataset

- Each cluster has a prototype and a subspace - these are subspace clusters

Taco cluster represented by Mom's Tacos and subspace is "meat" and "spice"

- Each feature for each observation is generated from a mixture over subspace clusters

- Within the subspace of the cluster, data are generated to agree with the prototype

If we are generating a taco recipe, "meat" is more likely to be generated as "beef" and

"spice" is more likely to be generated as "chili powder."

random forests

SVM

- People like to look at examples - case-based reasoning

- Bayesian Case Model is a prototype subspace clustering method

- Joint inference over prototypes for clusters, cluster assignments, and subspaces

Choose prototype for each cluster s

Choose subspace

Choose values for feature j

user control on size of subspace

If feature j is in the subspace, give higher probability to agreeing with the prototype on feature j (which takes value ).

Select a distribution over clusters

Select a cluster to generate feature j

Generate feature j of example i from the chosen cluster, and using the prototype to generate the outcome

user control on agreement with prototype

Summary

We should not ignore the use of predictive models when we create them.

Falling Rule Lists

Stratifies patients into decreasing risk categories within the predictive model

Machine Learning with Operational Costs

Take realistic prior knowledge into account about how the decision should turn out

Produce better prediction guarantees

user control on number of clusters per data point

experiments to verify that BCM outperforms LDA for classification performance (BCM doesn't have as strong independence assumptions)

human subject experiments to show BCM is more interpretable than LDA

Computational Criminology

Learning to Detect Patterns of Crime

Wang, Rudin, Wagner, Sevieri (ECML-PKDD, 2013)

A housebreak pattern we detected in Cambridge MA

Towards a Theory of Pattern Discovery

(Huggins and Rudin, SIAM Data Mining, 2014).

Finding Patterns with a Rotten Core: Data Mining for Crime Series with Core Sets

Wang, Rudin, Wagner, Sevieri. (Accepted with minor revision to Big Data, 2014)

Automated identification of crime series - subspace clustering (subspace is the M.O. of the criminal)

Supersparse Linear Integer Models

Collaboration with Mass General Hospital Sleep Lab

Personalized predictions of sleep apnea (OSA or UARS)

Falling!

Cynthia Rudin

Computer Science and Artificial Intelligence Lab

Sloan School of Management

Massachusetts Institute of Technology

On Using Predictive Models for Decisions

* Funding provided by: NSF, Siemens, Ford, Wistron, ENEL, Accenture, MIT Lincoln Labs, MIT Energy Initiative

- Code is on my website

- currently less than 1 millisecond per iteration

Combines predictions with decisions

- code coming soon to website

.98

.96

.71

.53

.42

.30

.29

Predict traffic

Y

N

Predict Readmission

Itemsets not yet chosen

Box Drawings for Learning with Imbalanced Data

Siong Thye Goh and Cynthia Rudin, KDD 2014

- Classifier composed of axis-aligned rectangles

- Disjunction of Conjuctions

- Uses MIP and approximations

I want to make a decision

E.g., Given features about the patients scheduled for the clinic next week, how do I schedule staff?

(The decision problems might be coupled over unlabeled points.)

Overlooks the fact that there might be multiple equally good predictive models giving different costs.

Bill

Peter

Come on! I know it won't cost that much!

Machine Learning with Operational Costs

T. Tulabandhula and C. Rudin.

Machine Learning with Operational Costs

,

Journal of Machine Learning Research, 2013

Predict with decision in mind

Today...

predicted processing time for a station in past

total time clinic is open

Staffing for a medical clinic

There is a possibility that, even if we buy the most valuable three houses, our profit could be 6-7% worse than we would ordinarily predict.

3% 16%

predicted profit for house i

* data from US Navy medical clinic

* Boston housing data

* Bronx manhole data

Posterior of the opcost can't be computed

Not as complex as you might think

Examples coming...

Linear relaxations of the optimization problems do this for us.

Summary for ML with Operational Costs

- new framework for learning theory /decision theory, incorporates prior knowledge on operational cost

- better prediction guarantees when practical knowledge about

decision-making

is incorporated into

predictions.

Falling Rule Lists

(Wang and Rudin, AISTATS, 2015)

Machine Learning with Operational Costs

(Tulabandhula and Rudin, JMLR, 2013)

Predict Decide

Decide Predict

**Predict Decide**

**Decide Predict**

n=7944, p=43, rules mined=103

time=33 sec, 6.6 milliseconds per iteration on a laptop

Goal: beat CART

Produce sparse predictive models that can fit on an index card.

Other benefits

- robust to outliers

- little to no parameter tuning, no annoying nested cross-validation

- they have meaningful parameters

- they provide reasons for each prediction

What kind of prior knowledge do managers have?

C4.5

CART

(Making Predictions Closer to Decisions)

(Knowledge About Decisions Can Help With Predictions)

Proof Outline

Prediction Analysis Lab Current Projects

Sparse logical models (FRL is one of them) and applications to healthcare

New decision theories (ML with OpCosts is one of them)

Energy reliability (hazard modeling, point process modeling)

Criminology (clustering)

ML spin on classic operations management problems (consumer choice models, multi-armed bandits, the newsvendor problem)

Causal inference, machine learning style

Express the point as a linear combination of the p points with coefficients gamma, draw K points at random from gamma. The expectation of doing this is less than 1/K so there is at least one good 1/K approximation.

Theja Tulabandhula and Cynthia Rudin. On Combining Machine Learning with Decision Making. Machine Learning, Volume 97, Issue 1-2, pp 33-64, 2014.

- Bound of a more geometric flavor for the same problem, volumes of spherical caps

Theja Tulabandhula and Cynthia Rudin. Generalization Bounds for Learning with Linear, Polygonal, Quadratic, and Conic Side Knowledge. Machine Learning, 2014.

- Handles very general constraints (polygonal, quadratic, conic) on the unlabeled points (all corresponding to real situations)

On Interpretable Classification Models for Recidivism Prediction

Jiaming Zeng, Berk Ustun, Cynthia Rudin

Goal: Predict whether someone released from prison will be rearrested (and for what type of crime) within 3 years

Largest public dataset, 33,796 useful records, full criminal history, age, etc.,

~50 features

used for bail/sentencing/allocation of social programs/reentry programs

large class of models performs approx equally well

- Code is on my website

Classification

where

Prior

Likelihood

Falling Rule Lists (With All The Notation)

Readmissions led to $41.3B in additional hospital costs in 2011 (Agency for Healthcare Research and Quality)

E.g., see Dunson, Bayesian Isotonic Regression for Discrete Outcomes, 2004

Algorithm for Falling Rule Lists coming up.

Classic operations research problems have

scheduling, routing, facility location, knapsack...

Current Work

Hierarchical Bayesian Modeling and Applications to Energy, Healthcare, Criminology

Point process modeling for manhole event predictions (AOAS, 2015 - with Tyler)

Hazard modeling with latent degradation states for wind turbine failure prediction

Modeling of recovery curves, with application to prostatectomy (with Tyler)

Machine Learning Methods / Sparse Logical Models

"The Bayesian Case Model" (NIPS 2014)

"Box Drawings for Imbalanced Data" (KDD 2014)

"Supersparse Integer Linear Models" for recidivism prediction & cognitive impairment diagnosis

"Bayesian Or's of And's" with application to marketing

"Bayesian Rule Lists" (with Tyler and David Madigan)

Causal Inference

Robust matched pairs for natural experiments (posted yesterday)

Hierarchical Bayesian "Self Controlled Case Series" (with David Madigan)

- one of a class of related papers

"Planning with preference" posed by F. Malucelli, Politecnico di Milano.

Cynthia Rudin, David Waltz, Roger N. Anderson, Albert Boulanger, Ansaf Salleb-Aouissi, Maggie Chow, Haimonti Dutta, Philip Gross, Bert Huang, Steve Ierome, Delfina Isaac, Arthur Kressner, Rebecca J. Passonneau, Axinia Radeva, Leon Wu.

Machine Learning for the New York City Power Grid.

IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, No 2. February 2012.

Quadratic

Prior Knowledge

Fung et al. 2002, James et al. 2014, Ngyugen and Caruana 2008

- knowledge about classes of unlabeled points (polygonal)

Gomez-Chova et al. 2008

- SVM with laplacian regularization on unlabeled points (quadratic)

Shivaswamy et al. 2006

- robust SVM (conic)

proof uses convex duality

Conic

Prior Knowledge

In The Past

Theoretical foundations and models for ranking in machine learning

Convergence properties of stepwise coordinate descent methods (boosting), connections to dynamical systems

Theoretical foundations for sparse logical rule-based models

Volume of Sphere without Spherical Cap

sharp cones -> high evalues -> smaller bound

NRC report

BJA

scaling

(MIP version also exists)