1) "Doing Data Science" with Schutt, Gattis and Crawshaw

2) "Google News Personalization: Scalable Online Collaborative Filtering" by Das, Datar, Garg, and Rajaram

Part 1:

3 Recommendation Engines

Factor analysis

Co-visitation

Latent topic analysis

(might want to label edges)

**Metadata**

gender

age

is_mac

Can think of them as items

How do we anticipate preferences?

Linear regression?

It's a start. Fix an item and find user i's rating of that item.

Good news/ Bad news

Good: closed-form solution

Bad: no sharing of information between items

Bad: too many items and missing items

Bad: this causes huge coefficients

But: you can add a prior/ penalty

But: then you'd be introducing other problems

Reducing dimensions: SVD decomposition

Interpreting SVD

U is a matrix whose rows correspond to users

V is a matrix whose rows correspond to items

S diagonal, entries <-> importance

k is rank but we can cut it down to approximate X

Computationally expensive

**PCA**

Start out dumber. Find U, V so that:

**PCA**

Q: What are minimizing?

A: Squared error (in 1st iteration):

Alternating Least Squares

**Optimize how?**

User by user (or item by item)

Comes down to linear regression

(keep doing this until nothing much changes)

rows of U <-> users

rows of V <-> items

columns <-> latent variables

There are d columns

You choose d

Simple problems get hard

with big data

Example: word frequency

red

bird

blue

green

red

green

How do you solve this?

By inspection, if there are only a few words

Or, if you have pages and pages, write an algorithm

Make sure you store 1 copy of each word (channel)

Scales to ~100 million words

Obstacle is your computer's memory

Can we squeeze out more?

Compress your words

Many cores - use them all

Can maybe do 10 trillion words

Q: What about if we have

more than 10 trillion words?

A: Use more computers!

If we have 10 computers, we can process 100 trillion words.

We might actually split up the work into parts and then send the results to a "controller" to add it all up.

What next?

As you grow your grid,

you get increasingly risky.

Say a computer fails with probability .

Then is close to 0.

But it's not equal to 0.

What is the chance that one of 1,000 computers fails?

Computers sometimes fail.

Example

It depends on the time frame but if = 0.3%,

then chances of something failing is more than 95%.

**HELP!!!**

We need to deal with robustness issues.

Hopefully separately from analytical issues.

Enter: MapReduce

MapReduce takes the robustness problems off your hands.

Copies data, deals with piping issues

Provided you can put your problem in its framework

Namely, a (series of) map step, then a reduce step.

Mapper

Takes data,

performs some function,

output is of form

(key, value)

Reducer

Aggregates by key.

performs function,

output is of form

(key, newvalue)

Example: word frequency

"red" ("red", 1)

(aggregates the "red"'s)

count the ("red", 1)'s

i.e. add up the 1's

Can we bring these

two ideas together?

Can we MapReduce our various Recommendation Engines?

First fix a user, so a length d column in U

Train a linear model with all known scores for U

There are not too many known scores per user

Invert a d-by-d matrix, which is small

Similarly, fix a movie.

There may be quite a few scores.

Covisitation is parallelizable

1. Factor model is parallelizable

**Thank you**

Latent Variables?

Example: sexiness

Certain things are associated with sexiness

Sexiness is really a combination of things

Not totally prescriptive

But we can approximate it pretty well

Is sexiness important?

mapper

reducer

fan-in

Part 3:

**Part 2:**

U=Users

V=Items

Broaden our goal

Looking for X:

Picture of SVD: before

Picture of SVD: after

Stupid MapReduce

Send relevant data for U to mapper - not too large!

Have the mapper perform task and

Mapper output = (U, answer)

Fan-in does nothing, reducer does nothing

Just need each algo to be smallish and parallelizable

**Start your own Netflix**

Cathy O'Neil

Johnson Research Labs

mathbabe.org

Factor analysis: You like sentimental films, this film is sentimental.

Covisitation: People who like what you like also like this.

Topic analysis: This is related to stuff you like.

What is a recommendation engine?

Amazon, Netflix, Pandora, Spotify, Google News, many others.

You need LOTS of data on people and their preferences to make these work well.

X=

With all this data we build a huge matrix X that encapsulates all the users and their preferences and attributes.

Rows are users, columns are preferences or attributes.

How do we make these computationally feasible?

Covisitation

1's and 0's (clicks)

User identified with history vector

Similarity of users defined by ratio of intersection and union:

Fixing u, find S(u, u') for all other u'.

Too hard! Instead, hash.

Map users to unsigned 64-bit numbers

"Similar users" more likely to collide

Thus create clusters of users.

User u reads stories S1, S2, ...

Randomize, then order story ID's

First item in common with u'?

How this works

Likelihood of collision is

S(u, u')

Basic idea behind each engine

How this really works

Use minhash to order ID's

Use more than one story

Lots of clusters per user

Remove small clusters

Latent Topic Modeling

Think about specifying a list of topics

Is this article likely to relate to this topic?

Is this person likely to enjoy this topic?

Is this the right topic?

Probabilistic Model

The model relates:

p(topic | user),

p(article | topic), and of course

p(topic | users, articles)

Iteratively computed

E-M algorithm

Maximize product of conditional likelihood over all data points

=

Minimize "log likelihood"

Alternate between optimizing

topics, other prob's

Mapper takes user history, makes cluster ID's

output: (userID, clusterID)'s

Fan-in collects users in same cluster

Reducer throws away small clusters, populates

a "big table" to help decide recommendations.

Latent Topic modeling is MapReducable

Choose two integers R and K and for i and j, choose user ID's i mod R and j mod K.

"Sufficient statistics" are relevant probabilities, so choose R and K large enough.

MapReduce to iterate on E-M algorithm

Caveats

Clicks might not be up-votes

Some people are over-represented

People may lie

Missing information is a problem

New people are a problem

Seasonal data is a problem