### Present Remotely

Send the link below via email or IM

• Invited audience members will follow you as you navigate and present
• People invited to a presentation do not need a Prezi account
• This link expires 10 minutes after you close the presentation

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

You can change this under Settings & Account at any time.

# MAA talk: Start your own Netflix

For the Financial Engineering's Practitioner's seminar on April 15, 2013
by

## CATHERINE ONEIL

on 22 January 2014

Report abuse

#### Transcript of MAA talk: Start your own Netflix

References:

1) "Doing Data Science" with Schutt, Gattis and Crawshaw

2) "Google News Personalization: Scalable Online Collaborative Filtering" by Das, Datar, Garg, and Rajaram

Part 1:

3 Recommendation Engines

Factor analysis
Co-visitation
Latent topic analysis

(might want to label edges)
gender
age
is_mac
Can think of them as items
How do we anticipate preferences?
Linear regression?

It's a start. Fix an item and find user i's rating of that item.
Good: closed-form solution
Bad: no sharing of information between items
Bad: too many items and missing items
But: you can add a prior/ penalty
But: then you'd be introducing other problems
Reducing dimensions: SVD decomposition
Interpreting SVD
U is a matrix whose rows correspond to users
V is a matrix whose rows correspond to items
S diagonal, entries <-> importance
k is rank but we can cut it down to approximate X
Computationally expensive
PCA
Start out dumber. Find U, V so that:
PCA
Q: What are minimizing?
A: Squared error (in 1st iteration):
Alternating Least Squares
Optimize how?
User by user (or item by item)
Comes down to linear regression
(keep doing this until nothing much changes)
rows of U <-> users
rows of V <-> items
columns <-> latent variables
There are d columns
You choose d
Simple problems get hard
with big data
Example: word frequency
red
bird
blue
green
red
green
How do you solve this?
By inspection, if there are only a few words
Or, if you have pages and pages, write an algorithm
Make sure you store 1 copy of each word (channel)
Scales to ~100 million words
Can we squeeze out more?
Many cores - use them all
Can maybe do 10 trillion words
Q: What about if we have
more than 10 trillion words?

A: Use more computers!
If we have 10 computers, we can process 100 trillion words.
We might actually split up the work into parts and then send the results to a "controller" to add it all up.
What next?
you get increasingly risky.
Say a computer fails with probability .
Then is close to 0.
But it's not equal to 0.
What is the chance that one of 1,000 computers fails?
Computers sometimes fail.
Example
It depends on the time frame but if = 0.3%,
then chances of something failing is more than 95%.
HELP!!!
We need to deal with robustness issues.
Hopefully separately from analytical issues.
Enter: MapReduce
MapReduce takes the robustness problems off your hands.
Copies data, deals with piping issues
Provided you can put your problem in its framework
Namely, a (series of) map step, then a reduce step.
Mapper
Takes data,
performs some function,
output is of form
(key, value)
Reducer
Aggregates by key.
performs function,
output is of form
(key, newvalue)
Example: word frequency

"red" ("red", 1)

(aggregates the "red"'s)

count the ("red", 1)'s
Can we bring these
two ideas together?
Can we MapReduce our various Recommendation Engines?
First fix a user, so a length d column in U
Train a linear model with all known scores for U
There are not too many known scores per user
Invert a d-by-d matrix, which is small
Similarly, fix a movie.
There may be quite a few scores.
Covisitation is parallelizable
1. Factor model is parallelizable
Thank you
Latent Variables?
Example: sexiness
Certain things are associated with sexiness
Sexiness is really a combination of things
Not totally prescriptive
But we can approximate it pretty well
Is sexiness important?
mapper
reducer
fan-in
Part 3:
Part 2:
U=Users
V=Items
Looking for X:
Picture of SVD: before
Picture of SVD: after
Stupid MapReduce
Send relevant data for U to mapper - not too large!
Have the mapper perform task and
Fan-in does nothing, reducer does nothing
Just need each algo to be smallish and parallelizable
Cathy O'Neil
Johnson Research Labs
mathbabe.org

Factor analysis: You like sentimental films, this film is sentimental.

Covisitation: People who like what you like also like this.

Topic analysis: This is related to stuff you like.

What is a recommendation engine?

Amazon, Netflix, Pandora, Spotify, Google News, many others.

You need LOTS of data on people and their preferences to make these work well.
X=
With all this data we build a huge matrix X that encapsulates all the users and their preferences and attributes.

Rows are users, columns are preferences or attributes.
How do we make these computationally feasible?
Covisitation

1's and 0's (clicks)
User identified with history vector
Similarity of users defined by ratio of intersection and union:
Fixing u, find S(u, u') for all other u'.

Map users to unsigned 64-bit numbers

"Similar users" more likely to collide

Thus create clusters of users.
User u reads stories S1, S2, ...
Randomize, then order story ID's
First item in common with u'?
How this works

Likelihood of collision is
S(u, u')
Basic idea behind each engine
How this really works
Use minhash to order ID's
Use more than one story
Lots of clusters per user
Remove small clusters
Latent Topic Modeling

Think about specifying a list of topics

Is this person likely to enjoy this topic?
Is this the right topic?
Probabilistic Model

The model relates:
p(topic | user),
p(article | topic), and of course
p(topic | users, articles)

Iteratively computed
E-M algorithm

Maximize product of conditional likelihood over all data points
=
Minimize "log likelihood"

Alternate between optimizing
topics, other prob's
Mapper takes user history, makes cluster ID's
output: (userID, clusterID)'s

Fan-in collects users in same cluster

Reducer throws away small clusters, populates
a "big table" to help decide recommendations.
Latent Topic modeling is MapReducable
Choose two integers R and K and for i and j, choose user ID's i mod R and j mod K.

"Sufficient statistics" are relevant probabilities, so choose R and K large enough.

MapReduce to iterate on E-M algorithm
Caveats