with Rachel Schutt (JRL)

Contributed by Matt Gattis

and David Crawshaw

Part 1:

Matt Gattis

and

Hunch.com

Hunch.com

Recently acquired by eBay

Asking only 20 questions allows the engine to predict the rest with 80% accuracy

How does it do this?

(might want to label edges)

**Metadata**

gender

age

is_mac

Can think of them as items

How do we anticipate preferences?

k-NN?

Too many dimensions - most edges don't exist

How do we anticipate preferences?

Linear regression?

It's a start. Fix an item.

Good news/ Bad news

Closed form solution

No sharing of information between items

Too many items

Missing items

This causes huge coefficients

You can add a prior/ penalty

But then you'd be introducing other problems

Reducing dimensions: SVD decomposition

Proof by notation

U is a matrix whose rows correspond to users

V is a matrix whose rows correspond to items

S diagonal, entries <-> importance

k is rank but we can cut it down to approximate X

Computionally expensive

**PCA**

Start out dumber. Find U, V so that:

**PCA**

Q: What are minimizing?

A: Squared error (in 1st iteration):

Alternating Least Squares

**Optimize how?**

User by user (or item by item)

Comes down to linear regression

(keep doing this until nothing much changes)

**Does this always work?**

Might have to increase the prior

I.e. increase penalty for large coefficients

No longer getting closest fit!

Not unique either

Especially if there are missing values

"Forming a data team is kind of like planning a Heist"

Hunch asks its users questions, then lets the user ask the

engine questions: "What kind of cell phone should I buy?"

rows of U <-> users

rows of V <-> items

columns <-> latent variables

d columns, you choose d

David Crawshaw

**MapReduce**

Google engineer

math nerd

Computational building block

Separates algorithms from robustness errors

Simple problems get hard

with big data

Example: word frequency

red

bird

blue

green

red

green

How do you solve this?

By inspection, if there are only a few words

Or, if you have pages and pages, write an algorithm

Make sure you store 1 copy of each word (channel)

Scales to ~100 million words

Obstacle is your computer's memory

Can we squeeze out more?

Compress your words

Many cores - use them all

Can maybe do 10 trillion words

Q: What about if we have

more than 10 trillion words?

A: Use more computers!

If we have 10 computers, we can process 100 trillion words.

We might actually split up the work into parts and then send the results to a "controller" to add it all up.

What next?

As you grow your grid,

you get increasingly fucked.

Say a computer fails with probability .

Then is close to 0.

But it's not equal to 0.

What is the chance that one of 1,000 computers fails?

Computers sometimes fail.

Example

It depends on the time frame but if = 0.3%,

then chances of something failing is more than 95%.

**HELP!!!**

We need to deal with robustness issues.

Hopefully separately from analytical issues.

Enter: MapReduce

MapReduce takes the robustness problems off your hands.

Copies data, deals with piping issues

Provided you can put your problem in its framework

Namely, a (series of) map step, then a reduce step.

Mapper

Takes data,

performs some function,

output is of form

(key, value)

Reducer

Aggregates by key.

performs function,

output is of form

(key, newvalue)

Example: word frequency

"red" ("red", 1)

(aggregates the "red"'s)

count the ("red", 1)'s

i.e. add up the 1's

Can we bring these

two ideas together?

Can we MapReduce our Recommendation Engine?

Data is of the form: (user, item, answer)

Two matrices U and V also in memory

We fix a user, so a length d column in U

Then we train a linear model for all known answers

There are not too many known answers per user

Invert a d-by-d matrix, which is small

ALS is also MapReducable

ALS is parallelizable

There might even be a better way...

Conclusion

Data Scientists need to know Stats and CS

Need to solve large-scale data problems

Usually involving people's private data

Please don't be creepy

Thank you

CTO of Hunch

Latent Variables?

Example: sexiness

Certain things are associated with sexiness

Sexiness is really a combination of things

Not totally prescriptive

But we can approximate it pretty well

mapper

reducer

fan-in

Part 3:

Part 2:

Encore: overburdened priors

We have a prior belief

We want an algo to converge

We want to optimize error

Now on Mahout

(open source hadoop)

U=Users

V=Items

Goal

Looking for X:

Picture of SVD: before

Picture of SVD: after

Other issues

Other issues?

People change over time!

People lie!

Some people are over-represented!

Because you could just do this:

Stupid reason

Send user id, triples, and row for given user into mapper

Have the mapper perform linear regression and

Mapper output = (user id, new row)

Fan-in does nothing, reducer does nothing

**Start your own Netflix**

Cathy O'Neil

Johnson Research Labs