Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Do you really want to delete this prezi?
Neither you, nor the coeditors you shared it with will be able to recover it again.
Make your likes visible on Facebook?
You can change this under Settings & Account at any time.
Transcript of Netflix Project
Data contains; date of rating, rating, user ID, movie ID.
Ratings from more than 500,000 users
More than 17,000 titles available to rate
Data consists of more than 100 million rows
Profitable and unprofitable customers:
Metrics: Revenue per view, average days between activity, average user rating.
Active customers are less profitable than passive customers....
Create a recommendation engine using a subset of Netflix’s database
Try to find a correlation between customer satisfaction and profitability
Large data set, R can not hold a 9 GB file in memory.
Original data spread over 17k+ text files. Data migration ~ hassle...
Can not re-cluster users every time a new user is added. Reference frame needed.
R and Large Data:
Locally hosted MySQL server.
Use package RMySQL, to communicate with the data base. Surprisingly easy to use.
Training and Results
Fine tuning the algorithm by randomly sampling from users and titles
Experiment with parameter to obtain the best result, best results obtained with k=5 and titles in common = 40. MAE=16%
Error measured by:
Happy or Unhappy Customers?
Unprofitable customers seem to be slightly happier.
Are Happy Customers Less Profitable?
Segment customers by average rating...
Customers: Happy vs Unhappy
File format: first line Movie ID-> UserID,Rating,Date
Developed a script in R to sequentially read from each file and prepare an SQL statement to write a data frame into DB.
Create an imaginary user with ratings = most probable rating for each movie. New and existing users can now be compared to this user and clustered accordingly
To check the recommendation result compare with the real rating. First we select the ClusterSize range from 10 to 150 and movies numbers range from 10 to 90.
Segment Customers by Profitability Profile
Select customers who have at least a 1 year subscription
Profitable customers: Rpv > $4.79
Unprofitable customers: Rpv< $1
Recommendation function ( Customer ID, Movie ID, Cluster size, number of movies)
The result shows when we select large ranges, not only the error rate start to increase, but also there are not enough customer we can compare with.
Actually rating : 5
Actually rating : 2
Customers: Happy vs Unhappy