Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Short version of MAA talk: Start your own Netflix
Transcript of Short version of MAA talk: Start your own Netflix
1) "Doing Data Science" with Schutt, Gattis and Crawshaw
2) "Google News Personalization: Scalable Online Collaborative Filtering" by Das, Datar, Garg, and Rajaram
3 Recommendation Engines
Latent topic analysis
(might want to label edges)
Can think of them as items
How do we anticipate preferences?
It's a start. Fix an item and find user i's rating of that item.
Good news/ Bad news
Good: closed-form solution
Bad: no sharing of information between items
Bad: too many items and missing items
Bad: this causes huge coefficients
But: you can add a prior/ penalty
But: then you'd be introducing other problems
Reducing dimensions: SVD decomposition
U is a matrix whose rows correspond to users
V is a matrix whose rows correspond to items
S diagonal, entries <-> importance
k is rank but we can cut it down to approximate X
Start out dumber. Find U, V so that:
Q: What are minimizing?
A: Squared error (in 1st iteration):
Alternating Least Squares
User by user (or item by item)
Comes down to linear regression
(keep doing this until nothing much changes)
rows of U <-> users
rows of V <-> items
columns <-> latent variables
There are d columns
You choose d
Simple problems get hard
with big data
Example: word frequency
How do you solve this?
By inspection, if there are only a few words
Or, if you have pages and pages, write an algorithm
Make sure you store 1 copy of each word (channel)
Scales to ~100 million words
Obstacle is your computer's memory
Can we squeeze out more?
Compress your words
Many cores - use them all
Can maybe do 10 trillion words
Q: What about if we have
more than 10 trillion words?
A: Use more computers!
If we have 10 computers, we can process 100 trillion words.
We might actually split up the work into parts and then send the results to a "controller" to add it all up.
As you grow your grid,
you get increasingly risky.
Say a computer fails with probability .
Then is close to 0.
But it's not equal to 0.
What is the chance that one of 1,000 computers fails?
Computers sometimes fail.
It depends on the time frame but if = 0.3%,
then chances of something failing is more than 95%.
We need to deal with robustness issues.
Hopefully separately from analytical issues.
MapReduce takes the robustness problems off your hands.
Copies data, deals with piping issues
Provided you can put your problem in its framework
Namely, a (series of) map step, then a reduce step.
performs some function,
output is of form
Aggregates by key.
output is of form
Example: word frequency
"red" ("red", 1)
(aggregates the "red"'s)
count the ("red", 1)'s
i.e. add up the 1's
Can we bring these
two ideas together?
Can we MapReduce our various Recommendation Engines?
First fix a user, so a length d column in U
Train a linear model with all known scores for U
There are not too many known scores per user
Invert a d-by-d matrix, which is small
Similarly, fix a movie.
There may be quite a few scores.
Covisitation is parallelizable
1. Factor model is parallelizable
Certain things are associated with sexiness
Sexiness is really a combination of things
Not totally prescriptive
But we can approximate it pretty well
Is sexiness important?
Broaden our goal
Looking for X:
Picture of SVD: before
Picture of SVD: after
Send relevant data for U to mapper - not too large!
Have the mapper perform task and
Mapper output = (U, answer)
Fan-in does nothing, reducer does nothing
Just need each algo to be smallish and parallelizable
Start your own Netflix
Johnson Research Labs
Factor analysis: You like sentimental films, this film is sentimental.
Covisitation: People who like what you like also like this.
Topic analysis: This is related to stuff you like.
What is a recommendation engine?
Amazon, Netflix, Pandora, Spotify, Google News, many others.
You need LOTS of data on people and their preferences to make these work well.
With all this data we build a huge matrix X that encapsulates all the users and their preferences and attributes.
Rows are users, columns are preferences or attributes.
How do we make these computationally feasible?
1's and 0's (clicks)
User identified with history vector
Similarity of users defined by ratio of intersection and union:
Fixing u, find S(u, u') for all other u'.
Too hard! Instead, hash.
Map users to unsigned 64-bit numbers
"Similar users" more likely to collide
Thus create clusters of users.
User u reads stories S1, S2, ...
Randomize, then order story ID's
First item in common with u'?
How this works
Likelihood of collision is
Basic idea behind each engine
How this really works
Use minhash to order ID's
Use more than one story
Lots of clusters per user
Remove small clusters
Latent Topic Modeling
Think about specifying a list of topics
Is this article likely to relate to this topic?
Is this person likely to enjoy this topic?
Is this the right topic?
The model relates:
p(topic | user),
p(article | topic), and of course
p(topic | users, articles)
Maximize product of conditional likelihood over all data points
Minimize "log likelihood"
Alternate between optimizing
topics, other prob's
Mapper takes user history, makes cluster ID's
output: (userID, clusterID)'s
Fan-in collects users in same cluster
Reducer throws away small clusters, populates
a "big table" to help decide recommendations.
Latent Topic modeling is MapReducable
Choose two integers R and K and for i and j, choose user ID's i mod R and j mod K.
"Sufficient statistics" are relevant probabilities, so choose R and K large enough.
MapReduce to iterate on E-M algorithm
Clicks might not be up-votes
Some people are over-represented
People may lie
Missing information is a problem
New people are a problem
Seasonal data is a problem