### Present Remotely

Send the link below via email or IM

• Invited audience members will follow you as you navigate and present
• People invited to a presentation do not need a Prezi account
• This link expires 10 minutes after you close the presentation

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

For the Financial Engineering's Practitioner's seminar on April 15, 2013
by

## CATHERINE ONEIL

on 12 June 2013

Report abuse

#### Transcript of Start your own Netflix

Taken from Doing Data Science
with Rachel Schutt (JRL)
Contributed by Matt Gattis
and David Crawshaw
Part 1:

Matt Gattis

and

Hunch.com
Hunch.com
Recently acquired by eBay
Asking only 20 questions allows the engine to predict the rest with 80% accuracy
How does it do this?
(might want to label edges)
gender
age
is_mac
Can think of them as items
How do we anticipate preferences?
k-NN?

Too many dimensions - most edges don't exist
How do we anticipate preferences?
Linear regression?

It's a start. Fix an item.
Closed form solution
No sharing of information between items
Too many items
Missing items
This causes huge coefficients
You can add a prior/ penalty
But then you'd be introducing other problems
Reducing dimensions: SVD decomposition
Proof by notation
U is a matrix whose rows correspond to users
V is a matrix whose rows correspond to items
S diagonal, entries <-> importance
k is rank but we can cut it down to approximate X
Computionally expensive
PCA
Start out dumber. Find U, V so that:
PCA
Q: What are minimizing?
A: Squared error (in 1st iteration):
Alternating Least Squares
Optimize how?
User by user (or item by item)
Comes down to linear regression
(keep doing this until nothing much changes)
Does this always work?
Might have to increase the prior
I.e. increase penalty for large coefficients
No longer getting closest fit!
Not unique either
Especially if there are missing values
"Forming a data team is kind of like planning a Heist"
engine questions: "What kind of cell phone should I buy?"
rows of U <-> users
rows of V <-> items
columns <-> latent variables
d columns, you choose d

David Crawshaw
MapReduce
math nerd
Computational building block
Separates algorithms from robustness errors
Simple problems get hard
with big data
Example: word frequency
red
bird
blue
green
red
green
How do you solve this?
By inspection, if there are only a few words
Or, if you have pages and pages, write an algorithm
Make sure you store 1 copy of each word (channel)
Scales to ~100 million words
Can we squeeze out more?
Many cores - use them all
Can maybe do 10 trillion words
Q: What about if we have
more than 10 trillion words?

A: Use more computers!
If we have 10 computers, we can process 100 trillion words.
We might actually split up the work into parts and then send the results to a "controller" to add it all up.
What next?
you get increasingly fucked.
Say a computer fails with probability .
Then is close to 0.
But it's not equal to 0.
What is the chance that one of 1,000 computers fails?
Computers sometimes fail.
Example
It depends on the time frame but if = 0.3%,
then chances of something failing is more than 95%.
HELP!!!
We need to deal with robustness issues.
Hopefully separately from analytical issues.
Enter: MapReduce
MapReduce takes the robustness problems off your hands.
Copies data, deals with piping issues
Provided you can put your problem in its framework
Namely, a (series of) map step, then a reduce step.
Mapper
Takes data,
performs some function,
output is of form
(key, value)
Reducer
Aggregates by key.
performs function,
output is of form
(key, newvalue)
Example: word frequency

"red" ("red", 1)

(aggregates the "red"'s)

count the ("red", 1)'s
Can we bring these
two ideas together?
Can we MapReduce our Recommendation Engine?
Data is of the form: (user, item, answer)
Two matrices U and V also in memory
We fix a user, so a length d column in U
Then we train a linear model for all known answers
There are not too many known answers per user
Invert a d-by-d matrix, which is small
ALS is also MapReducable
ALS is parallelizable
There might even be a better way...
Conclusion
Data Scientists need to know Stats and CS
Need to solve large-scale data problems
Usually involving people's private data
Thank you
CTO of Hunch
Latent Variables?
Example: sexiness
Certain things are associated with sexiness
Sexiness is really a combination of things
Not totally prescriptive
But we can approximate it pretty well
mapper
reducer
fan-in
Part 3:
Part 2:
Encore: overburdened priors
We have a prior belief
We want an algo to converge
We want to optimize error
Now on Mahout
U=Users
V=Items
Goal
Looking for X:
Picture of SVD: before
Picture of SVD: after
Other issues
Other issues?
People change over time!
People lie!
Some people are over-represented!
Because you could just do this:
Stupid reason
Send user id, triples, and row for given user into mapper
Have the mapper perform linear regression and
Mapper output = (user id, new row)
Fan-in does nothing, reducer does nothing