Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Data Science at Kreditech
Transcript of Data Science at Kreditech
Jose Garcia Moreno-Torres
Prelude: tonight's menu
The real-world issues no one told me about in school
Main course A: Binary (yes/no) classification problem whose solution created a whole subset of problems.
Main course B: Pair of regression problems where data is missing left and right
Dessert: Short overview of other problems we work on
All sparkled with a brief look at the history of Data Science at Kreditech
It's a crazy world
.... and he found it was BAD. He investigated and investigated, only to find that his data contained too much information, information that was not available in the production environment.
Main course A: Credit Scoring
Core problem at Kreditech
Classic step-by-step approach:
Data cleaning and preparation
Feature selection and construction
The only performance that matters is the one you achieve in the production environment
Our step by step approach:
– Data cleaning and preparation
– Evaluation metric definition
– Feature selection and construction
(Before we solve a problem, we need to understand what we are solving)
Simple binary classification problem
Why is it complex?
– Reject inference: Only get feedback on “yes” cases
– Feedback delay: Give a loan today, learn whether it was the right decision in a few months
– Unstable conditions: volatile environment (changes in website, for example, or technical errors) and never-seen seasonality (Christmas)
How to evaluate? Statistics vs Business goal
Limit estimation: How much money should I lend?
Goal: Maximize amount, minimize defaults, minimize amount of customers who leave
Main course B: Twin regression problems
Pricing: How much should I charge this customer?
Goal: Maximize prize, minimize amount of customers who leave
If customer A accepted
to pay 10%, would he
have also accepted 20%?
If customer B left
because he found
20% too expensive,
would he have
taken the offer at
Squeeze the data
(customer A almost surely
would have accepted 8% ->
extra data point)
Active learning: choose amounts or prices that would help you learn
Other applications (saving something for the next talk)
Loan limit model for recurring customers:
Extra information = extra data sources
Choose the right debt collection strategy:
Several options available, pick one to apply to a customer, wait a few days, pick another one
Multiple instance learning, active learning
Customer Lifetime Value estimation:
Time series modeling, distance-based methods
In the beginning, there was only data.
The unsuspecting data scientist, freshly hired into a tech startup, built a model, and he saw the performance was good.
He was satisfied, so he deployed the model...
Reject inference: Extend acceptance past optimal operations to learn from further cases
Feedback delay: Heuristically approximate future performance
a) Build models resilient to it (high regularization)
b) Try and detect shifts in data distribution