Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks


No description

Cynthia Rudin

on 25 July 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of FRL_OpCostMar2015

Bayesian Rule Lists
Falling Rule Lists
Goal: Predict 30 day readmissions

Data from Mass General Hospital collaborators:
- ~8,000 patients who had no prior history of readmissions
- Features are very detailed, e.g., "impaired mental status," "difficult behavior," "chronic pain," "feels unsafe," etc.
- Labels for each patient, whether readmitted or not
- List size parameter set at 8 rules, no parameter tuning
Bayesian Rule Lists
The Bayesian Case Model
Other Models
"This simplicity gets at the important issue: A decent transparent model that is actually used will outperform a sophisticated system that predicts better but sits on a shelf. If the researchers had created a model that predicted well but was more complicated, the LAPD likely would have ignored it, thus defeating the whole purpose."

--- Greg Ridgeway, Director of National Institute of Justice
LAPD Recruits Model
Goal: Beat CART (Classification and Regression Trees)
C4.5 Quinlan 1993
CART Breiman 1984
Cart is the "go to" method in industry for interpretable classification.

Works well but is greedy.

Sacrifices both accuracy and interpretability
rush hour?
no traffic
no traffic
no traffic
I hypothesize that many real problems are
This is flat
Classification Error
If I'm right, there exist many very sparse yet very accurate models. This means we can find these models if we look hard enough.

If I'm right, there might be many different decisions corresponding to the range of reasonable models.

Related to the Rashomon Effect of Breiman
Example: Predict whether I'll get stuck in traffic
IF game at fenway park THEN traffic with 97% probability (97/100)
ELSE IF not rush hour THEN no traffic with 90% probability (474/523)
ELSE IF no rain & no construction THEN no traffic with 68% prob (329/482)
ELSE IF Friday THEN no traffic with 75% probability (30/40)
ELSE IF rain THEN traffic with 51% probability (452/892)
OTHERWISE no traffic with 66% probability (10/15)
(same form as expert systems)
Building Interpretable Classifiers with Rules and Bayesian Analysis
B. Letham, C. Rudin, T. McCormick, D. Madigan
- Winner of Data Mining Best Student Paper Competition, INFORMS 2013
- Winner of Student Paper Competition, Statistical Learning and Data Mining Section, American Statistical Association, 2014
Here's the model
If diabetes and hypertension ----> stroke
Else If age < 40 and tylenol ----> no stroke
Else If obese and hypertension ---> stroke
Else If age >50 and aspirin -----> no stroke
Else if Betapace -----> no stroke
Otherwise predict stroke
User determines length

- Restriction to discovered itemsets is the approximation
(avoids large NP hard problem)
- Might make the model more interpretable
- Complexity of model class is similar to decision trees
(interpretability is an illusion)
User determines width
User's preferred probabilties
Step 1:
Mine Frequent Itemsets
Step 2:
Learn Permutation of Itemsets and Probabilities for Rules
age<75 & hip fracture
past stroke & warfarin
diabetes & hypertension
history of falls & age > 60
IF age > 75 & days_in_hospital >10 THEN readmission risk is 31%
ELSE IF poor prognosis THEN readmission risk is 25%
ELSE IF pain scale > 8 THEN readmission risk is 13%
: : : :
ELSE readmission risk is 1%
Bayesian Rule Lists
Metropolis-Hastings Sampling:
- SWAP - switch two rules
- ADD - insert a rule
- REMOVE - delete a rule
Goal: Measure stroke risk among patients with atrial fibrillation (Beat the CHADS2 Score)
>1000 times the amount of data
CHADS2 was originally developed using data from 1,733 Medicare beneficiaries.
5 features

5 people had the maximum score.

We started with 11.1 million Medicaid enrollees, took those with a diagnosis of atrial fibrillation, one-year of atrial fibrillation-free observation time prior to the diagnosis, and one year of observation time following the diagnosis (12,586 patients).

We considered all drugs and conditions (4,146)

2,200 pre-mined rules
Our current "CHADS2"
If Hemiplegia stroke risk 58.9%
Elseif Cerebrovascular disorder stroke risk 47.8%
Elseif Prior TIA stroke risk 23.8%
Elseif Occlusion/stenosis of carotid artery stroke risk 16.4%
Elseif Altered state of consciousness
and age > 60 stroke risk 16%
Elseif age <= 70 stroke risk 4.6%
OTHERWISE stroke risk 8.7%
(Rudin, Letham, Madigan, JMLR 2013) has theorems about:
- complexity (VC dimension) of set of decision lists composed of if-then rules
- stability of decision lists (leading to performance guarantees)

Tic Tac Toe
Goal: Given a finished tic tac toe board, determine whether x has won the game
=.8873 +/- .0061
CART on Tic Tac Toe
C4.5 on Tic Tac Toe
=.9259 +/- .0066
=.8873 +/- .0061
Bayesian Rule Lists on Tic Tac Toe
=1 +/- 0
F. Wang and C. Rudin. AISTATS, 2015
- Winner of Student Paper Competition, Statistical Learning and Data Mining Section, American Statistical Association, 2015

Falling Rule Lists
Falling Rule Lists serve two purposes:
- Rank rules to form a predictive model
- Stratify patients into decreasing risk categories
Easier for decision makers to assess risks
The most at-risk patients should be handled with the highest priority.
FRL are a type of decision tree. Decision tree methods do not produce FRL.
Not greedy, computationally more difficult. Model above created in 35 seconds.
User determines length
User determines preferences over rules
User preference over gaps between probabilities
The Bayesian Case Model: A Generative Approach for
Case-Based Reasoning and Prototype Classification
B. Kim, C. Rudin, and J. Shah, NIPS 2014
This cluster is represented by...
...this quintessential observation (the prototype)...
...and here are the important characteristics (the subspace).
Computational Criminology
- crime pattern detection (subspace clustering)
- recidivism prediction (interpretable modeling)
The Latent State Hazard Model

(with Ramin Moghaddass)

Separates out the latent (internal) vulnerability from vulnerability due to external sources for interpretability

SCADA data from large Italian wind farm through partnership with ENEL/Accenture. Streaming measurements every 10 minutes from 44 sensors on each turbine.
total hazard
latent degradation state
influence of external factors
Energy Reliability
Combining Predictions with Decision-Making
(Machine Learning and Optimization)
Theja Tulabandhula and Cynthia Rudin. Machine Learning with Operational Costs. Journal of Machine Learning Research, Volume 14, pages 1989-2028, 2013.

Theja Tulabandhula and Cynthia Rudin. On Combining Machine Learning with Decision Making. Machine Learning, Volume 97, Issue 1-2, pp 33-64, 2014.

Theja Tulabandhula and Cynthia Rudin. Robust Optimization using Machine Learning for Uncertainty Sets. Proceedings of the International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2014.

Theja Tulabandhula and Cynthia Rudin. Generalization Bounds for Learning with Linear, Polygonal, Quadratic, and Conic Side Knowledge. Machine Learning , 2014.
Internet and Marketing
mammographic mass dataset
- Each cluster has a prototype and a subspace - these are subspace clusters
Taco cluster represented by Mom's Tacos and subspace is "meat" and "spice"
- Each feature for each observation is generated from a mixture over subspace clusters
- Within the subspace of the cluster, data are generated to agree with the prototype
If we are generating a taco recipe, "meat" is more likely to be generated as "beef" and
"spice" is more likely to be generated as "chili powder."
random forests
- People like to look at examples - case-based reasoning
- Bayesian Case Model is a prototype subspace clustering method
- Joint inference over prototypes for clusters, cluster assignments, and subspaces
Choose prototype for each cluster s
Choose subspace
Choose values for feature j
user control on size of subspace
If feature j is in the subspace, give higher probability to agreeing with the prototype on feature j (which takes value ).
Select a distribution over clusters
Select a cluster to generate feature j
Generate feature j of example i from the chosen cluster, and using the prototype to generate the outcome
user control on agreement with prototype
We should not ignore the use of predictive models when we create them.

Falling Rule Lists
Stratifies patients into decreasing risk categories within the predictive model

Machine Learning with Operational Costs
Take realistic prior knowledge into account about how the decision should turn out
Produce better prediction guarantees
user control on number of clusters per data point
experiments to verify that BCM outperforms LDA for classification performance (BCM doesn't have as strong independence assumptions)
human subject experiments to show BCM is more interpretable than LDA
Computational Criminology
Learning to Detect Patterns of Crime
Wang, Rudin, Wagner, Sevieri (ECML-PKDD, 2013)
A housebreak pattern we detected in Cambridge MA
Towards a Theory of Pattern Discovery
(Huggins and Rudin, SIAM Data Mining, 2014).
Finding Patterns with a Rotten Core: Data Mining for Crime Series with Core Sets
Wang, Rudin, Wagner, Sevieri. (Accepted with minor revision to Big Data, 2014)
Automated identification of crime series - subspace clustering (subspace is the M.O. of the criminal)
Supersparse Linear Integer Models
Collaboration with Mass General Hospital Sleep Lab
Personalized predictions of sleep apnea (OSA or UARS)
Cynthia Rudin
Computer Science and Artificial Intelligence Lab
Sloan School of Management
Massachusetts Institute of Technology
On Using Predictive Models for Decisions
* Funding provided by: NSF, Siemens, Ford, Wistron, ENEL, Accenture, MIT Lincoln Labs, MIT Energy Initiative
- Code is on my website
- currently less than 1 millisecond per iteration
Combines predictions with decisions
- code coming soon to website




Predict traffic
Predict Readmission
Itemsets not yet chosen
Box Drawings for Learning with Imbalanced Data
Siong Thye Goh and Cynthia Rudin, KDD 2014
- Classifier composed of axis-aligned rectangles
- Disjunction of Conjuctions
- Uses MIP and approximations
I want to make a decision
E.g., Given features about the patients scheduled for the clinic next week, how do I schedule staff?

(The decision problems might be coupled over unlabeled points.)

Overlooks the fact that there might be multiple equally good predictive models giving different costs.

Come on! I know it won't cost that much!
Machine Learning with Operational Costs
T. Tulabandhula and C. Rudin.
Machine Learning with Operational Costs
Journal of Machine Learning Research, 2013

Predict with decision in mind
predicted processing time for a station in past

total time clinic is open

Staffing for a medical clinic
There is a possibility that, even if we buy the most valuable three houses, our profit could be 6-7% worse than we would ordinarily predict.
3% 16%

predicted profit for house i
* data from US Navy medical clinic
* Boston housing data
* Bronx manhole data
Posterior of the opcost can't be computed
Not as complex as you might think
Examples coming...
Linear relaxations of the optimization problems do this for us.
Summary for ML with Operational Costs

- new framework for learning theory /decision theory, incorporates prior knowledge on operational cost

- better prediction guarantees when practical knowledge about
is incorporated into

Falling Rule Lists
(Wang and Rudin, AISTATS, 2015)

Machine Learning with Operational Costs
(Tulabandhula and Rudin, JMLR, 2013)
Predict Decide
Decide Predict
Predict Decide
Decide Predict
n=7944, p=43, rules mined=103
time=33 sec, 6.6 milliseconds per iteration on a laptop

Goal: beat CART

Produce sparse predictive models that can fit on an index card.

Other benefits
- robust to outliers
- little to no parameter tuning, no annoying nested cross-validation
- they have meaningful parameters
- they provide reasons for each prediction
What kind of prior knowledge do managers have?
(Making Predictions Closer to Decisions)
(Knowledge About Decisions Can Help With Predictions)
Proof Outline
Prediction Analysis Lab Current Projects
Sparse logical models (FRL is one of them) and applications to healthcare
New decision theories (ML with OpCosts is one of them)
Energy reliability (hazard modeling, point process modeling)
Criminology (clustering)
ML spin on classic operations management problems (consumer choice models, multi-armed bandits, the newsvendor problem)
Causal inference, machine learning style
Express the point as a linear combination of the p points with coefficients gamma, draw K points at random from gamma. The expectation of doing this is less than 1/K so there is at least one good 1/K approximation.
Theja Tulabandhula and Cynthia Rudin. On Combining Machine Learning with Decision Making. Machine Learning, Volume 97, Issue 1-2, pp 33-64, 2014.

- Bound of a more geometric flavor for the same problem, volumes of spherical caps
Theja Tulabandhula and Cynthia Rudin. Generalization Bounds for Learning with Linear, Polygonal, Quadratic, and Conic Side Knowledge. Machine Learning, 2014.

- Handles very general constraints (polygonal, quadratic, conic) on the unlabeled points (all corresponding to real situations)
On Interpretable Classification Models for Recidivism Prediction
Jiaming Zeng, Berk Ustun, Cynthia Rudin
Goal: Predict whether someone released from prison will be rearrested (and for what type of crime) within 3 years

Largest public dataset, 33,796 useful records, full criminal history, age, etc.,
~50 features
used for bail/sentencing/allocation of social programs/reentry programs
large class of models performs approx equally well
- Code is on my website
Falling Rule Lists (With All The Notation)
Readmissions led to $41.3B in additional hospital costs in 2011 (Agency for Healthcare Research and Quality)
E.g., see Dunson, Bayesian Isotonic Regression for Discrete Outcomes, 2004
Algorithm for Falling Rule Lists coming up.

Classic operations research problems have
scheduling, routing, facility location, knapsack...
Current Work
Hierarchical Bayesian Modeling and Applications to Energy, Healthcare, Criminology
Point process modeling for manhole event predictions (AOAS, 2015 - with Tyler)
Hazard modeling with latent degradation states for wind turbine failure prediction
Modeling of recovery curves, with application to prostatectomy (with Tyler)
Machine Learning Methods / Sparse Logical Models
"The Bayesian Case Model" (NIPS 2014)
"Box Drawings for Imbalanced Data" (KDD 2014)
"Supersparse Integer Linear Models" for recidivism prediction & cognitive impairment diagnosis
"Bayesian Or's of And's" with application to marketing
"Bayesian Rule Lists" (with Tyler and David Madigan)
Causal Inference
Robust matched pairs for natural experiments (posted yesterday)
Hierarchical Bayesian "Self Controlled Case Series" (with David Madigan)
- one of a class of related papers
"Planning with preference" posed by F. Malucelli, Politecnico di Milano.
Cynthia Rudin, David Waltz, Roger N. Anderson, Albert Boulanger, Ansaf Salleb-Aouissi, Maggie Chow, Haimonti Dutta, Philip Gross, Bert Huang, Steve Ierome, Delfina Isaac, Arthur Kressner, Rebecca J. Passonneau, Axinia Radeva, Leon Wu.
Machine Learning for the New York City Power Grid.
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, No 2. February 2012.
Prior Knowledge
Fung et al. 2002, James et al. 2014, Ngyugen and Caruana 2008
- knowledge about classes of unlabeled points (polygonal)
Gomez-Chova et al. 2008
- SVM with laplacian regularization on unlabeled points (quadratic)
Shivaswamy et al. 2006
- robust SVM (conic)
proof uses convex duality
Prior Knowledge
In The Past
Theoretical foundations and models for ranking in machine learning
Convergence properties of stepwise coordinate descent methods (boosting), connections to dynamical systems
Theoretical foundations for sparse logical rule-based models
Volume of Sphere without Spherical Cap
sharp cones -> high evalues -> smaller bound
NRC report
(MIP version also exists)
Full transcript