Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Understanding and predicting customer behavior with R and Causata

For more information, visit http://causata.com. This presentation was delivered to the Bay Area useR Group (BARUG) on May 14, 2013

Justin Hemann

on 31 October 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Understanding and predicting customer behavior with R and Causata

Understanding and predicting customer behavior
with R and Causata

Bay Area useR Group (BARUG) Meeting
May 14, 2013

Justin Hemann & Robert Frankus
The challenge:
Suppose that your company has many thousands or millions of customers. You want to give your customers a more personalized experience.

What do you do?
The essentials
Connect data across touch points to create a single view of the customer
Create rich customer profiles to understand and predict customer needs without rigid ETL processes
Deliver a more personalized experience across any touch point -- the web, call center, stores, etc.
Turning into
To understand customers, you have to connect their data in time order
What data should you collect and connect?
Prioritize data that represents customer intent or initiative
Build variables over multiple timescales
Intent / initiative
Not intent / initiative
Product use
Feature use
Web content viewed
Web clicks
Email opens
Support calls / emails
Ad / offer clicks
Email sends
Ad impressions
'Slot car' interactions where a customer cannot deviate from a prescribed path
The data on the left will generally be much more predictive than the data on the right

Demographics (age, zipcode, etc) may be strong predictors, but generally the intent / initiative variables are stronger
Using R to evaluate variables
Suppose you have several thousand variables -- how do you sort out which are predictive? Where should you focus your effort to create new variables?
The answer depends on the problem (classification, regression, etc)
The Causata R package includes a function to quickly evaluate variables for binary classification -- BinaryPredictor()
The algorithm is inspired by a KDD cup winner. See http://www.touri.mtome.com/Publications/CiML/CiML-v3-book.pdf, p. 45
Don't overfit!
It's common to have over 1000 potential variables, thus there is a strong potential for spurious correlations
Use cross validation, regularization, etc.
Most CXM problems are framed in terms of binary classification
Popular algorithms include logistic regression and trees
Built-in variable selection is essential
R package favorites:
party - trees and random forests
Combination of lasso / ridge regression, aka L1 or L2 norm regularization
Groups / selects correlated variables
Very fast
Unlike most tree algorithms, conditional inference trees are not biased to select categorical variables with many levels
Another nice ensemble tree package: gbm
The Classification And REgression Training package includes tools for training and tuning models
Simplifies and manages the task of building many models with various combinations of tuning parameters
Makes parallelized cross validation easy
5 axioms for Customer Experience Management (CXM)
Build data around customers
Essential for capturing intent and interest

Build data around focal points
Essential for determining cause and effect

Don't overfit!
Cross validation, regularization, permutation tests, etc.

Move algorithms to data, not data to algorithms
Moving data is slow, moving algorithms is fast

Real-time interactions require real-time data
Examples: web, call centers, triggered emails

Build data around customers
Store data as events
Prioritize data that represents customer intent and events triggered by customer initiative
You may have hundreds or thousands of variables -- use R to help sort out which are useful
Store data as events
Events are aggregated to produce variables
This arrangement gives you tremendous flexibility
Store raw data and extract features / variables as needed
ETL can be executed at query time, not in batch
Events make it easy to reconstruct a customer profile as it looked in the past (more on this later)
Build data around focal points
Think of a focal point as a moment where you want to use what you know before to understand and predict what happens later

What does it mean to build data around a focal point?
Three focal points

A date, e.g. March 10

An event, e.g. a purchase
Why is this important?
The short answer: building data around focal points yields better models and predictions

The longer answer:
Focal points are tools to understand cause and effect
The focal point allows you to reconstruct customer profiles as they were in the past
Viewed through focal points, insights about customer behavior can become blindingly obvious
The plot shows web click rates for a prompt to learn more about a product.

Each dot represents the average response rate for all customers at their first, second, third... twentieth prompt.

The response rate drops with each prompt -- don't over-prompt!

You would never notice this drop by looking at aggregated customer data, or not aligning focal points (prompts)
How can I use focal points in R?
Two options:
1) Write a lot of preprocessing code
2) Causata has a SQL interface with extensions that control focal points
Move algorithms, not data
Data storage is getting cheaper by the day, but network speeds aren't improving as quickly
The fastest solution is to execute the scoring algorithms near the data
Beware of using different systems for generating model training data and scoring data -- reconciling two systems can be a painful exercise
PMML is a standard for specifying and migrating models -- it's generated by the Rattle R package and Causata
Real-time interactions require real-time data and scoring
Devil's advocate: "Sure, data changes quickly, but the relevant data is often slow-moving. Data and predictions with x {hours,days,weeks} latency will be fine."
My experience in CXM is that latency matters. A lot.

The gains chart at the right show scores predicting customer interest in a loan product using quarterly financial data (MCIF) and scores from Causata models using real-time data.

The Causata scores have much, much higher lift.
Build variables over multiple timescales
Variables built with events over longer periods (last year) will often compliment variables built with subsets of the events over shorter periods (last week)
A count of page views over the last year may indicate long-term interest
A surge in page views in the last week may indicate that a decision is imminent
Some customers make multiple purchases, and others disappear after their first. Why?
Focal point: first purchase. Use events up to the first purchase to predict if there will be another purchase.

Some customers start using a new / product feature. Why?
Focal point: 1 week ago. Use events up to 1 week ago to predict who will start using the new product / feature over the next week.
A blindingly obvious observation
© 2013 Causata Inc. All Rights Reserved
Full transcript