**Hadoop and Mahout**

**The history**

**Google publicized their MapReduce algorithm**

Doug Cutting developed the open source version of this system called Hadoop.

Many famous companies support this project

Doug Cutting developed the open source version of this system called Hadoop.

Many famous companies support this project

**What is Mahout**

What problem does Mahout solves

Mahout

Recommendation

Classification

Clustering

Who's using Mahout

Mahout's 3 big pillars (or what can Mahout do)

Recommendation

Clustering

Classification

Content-based

Collaborative filtering

The recommendation is based on attributes of item

Articles rec.

Music rec.

Possible attributes for recommendation

Content

Journalist

Category

Published date

...

Which ones make sense

Newspaper format : content

Forum format: ?

Possible attributes for recommendation

Singer

Album

Genre: pop, rock, etc.

...

Which ones make sense

All of the above

The take away: content based rec. varies from domains to domains and Mahout doesn't have much to say about it

Based on, and only on, knowledge of users’ relationships

to items. The items and users are black-boxes

Article rec.

Given

2 - Clue 2

1 - Clue 1

Question

What do you recommend for user 3 ?

Real life problem

The data

Traditional Mahout's recommenders

User based

Item based

Slope-one

New and experimental recommenders

SVD (singular value decomposition)

Knn (k nearest neighbors)

Tree clustering

User based rec.

The code

Explanation

Data model

Similarity

Neighborhood

Efficiently design for big data processing

Has various implementations

GenericDataModel: in memory

FileDataModel: refreshable, update-able

Database DataModel

BooleanPreferences

Similarity metric is the notion of sameness between two things, whether they’re users or items.

Answer the question: how much common do we share?

Supported similarity metrics

Pearson correlation–based

Employing weighting

Euclidean distance

Adapting the cosine measure

Spearman correlation

Tanimoto coefficient

Log-likelihood

The problem w/ exploring for every user

Terribly slow

Doesn't add much value

Sometimes gives noises and decreases the recommendation quality

Types of neighborhoods

Fixed size: the top n similar users

Threshold-base: users that are >95% similar

Euclidean distance

Cosine similarity

Measure the linear correlation (dependence) between two variables X and Y

Implementation is based on the distance between user points in space of item vectors

Pearson correlation

Correlation between User 1 & User 2

Correlation between User 1 & User 5

What does real life data look like?

Address some problems with Pearson correlation

It doesn’t take into account the number of items in which two users’ preferences overlap

If two users overlap on only one item

The correlation is also undefined if either series of preference values are all identical

The solution is to push positive correlation values toward 1.0 and negative

values toward –1.0 when the correlation is based on more items

Pearson Correlation similarity w/ weighting

3 dimensions - 3 items

n dimensions

2 dimensions - 2 items

Spearman correlation

Tanimoto coefficient

Log-likelihood based

Item based rec.

Similarity metrics

Similar to Euclidean distance: depends on envisioning user preferences as points in space

The value is calculated based on the angle between 2 users' vector

A variant of Pearson correlation: the original value is transformed to relative rank

Ignore preference values completely

Is measured by:

s = intersection / total

An improvement to Tanimoto

Is natural logarithm of likelihood function

Likelihood function in a nutshell

Probability

Likelihood

Probability vs Likelihood

Describe a function of outcome given a fixed parameter value.

For example, if a coin is flipped 100 times and it is a fair coin, what is the probability of it landing heads-up every time?

Describing a function of a parameter given an outcome.

For example, if a coin is flipped 100 times and it has landed heads-up 100 times, what is the likelihood that the coin is fair?

Unlike other calculation, even User 1 and User 1 doesn't have 100% percent similarity

**The big pictủure**

**Open Source project**

Writing and running distributed applications that work with BigData

Writing and running distributed applications that work with BigData

Accessible

Simple

Robust

Scalable

**Characteristics**

Without Hadoop

With Hadoop

Code that runs in one machine

Writing distributed code (manually)

The problems

Store file over many processing machines

Store data on hard disk, not on memory

Partition the intermediate data from Phase 1

Shuffle the partitions to machines in Phase 2

...

The mapping phase:

**What is Hadoop**

Real world example

word count

"Do as I say, not as I do"

Hadoop filter and transform the input to <k2,v2>

The reducing phase

The input

Key value pair <k1,v1>

All key sharing same k2 will be aggregate <k2, list<v2>)

Aggregate the input from mapper to product the desired output <k3, v3>

In this case it's <filename, document>:

<"_foo", "do as I say, not as I do"

Foursquare

Yahoo! Mail

Make Machine Learning accessible

Contribute-able

The same techniques can be used for Item based recommendation

Slope one rec.

The big idea

Base on the assumption that there's a linear relationship between items

i2 = m*i1 + b

Another reasonable assumption

m = 1 ( slope one)

Pros

Cons

Evaluation

Fast

Simplicity

Effective

Can be calculated offline

Updating data is simple

Memory consumption

Hadoop's daemon

HDFS architecture

Job architecture

NameNode

Secondary NameNode

DataNode

JobTracker

TaskTracker

Putting it all together

3 modes

Local (standalone) mode

Pseudo distributed mode

Fully distrubuted mode

Other experimental

recommenders

SVD

Knn based

Cluster based

Testing & evaluation

The concept

Seperate the data into 2 set:

Use one part for training

The other part for testing against the calculated result

Framework support

Famous last words

The take away