Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Hadoop and Mahout

No description
by

Chien Vu Quang

on 8 July 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Hadoop and Mahout

Hadoop and Mahout
The history
Google publicized their MapReduce algorithm
Doug Cutting developed the open source version of this system called Hadoop.
Many famous companies support this project

What is Mahout
What problem does Mahout solves
Mahout
Recommendation
Classification
Clustering
Who's using Mahout
Mahout's 3 big pillars (or what can Mahout do)
Recommendation
Clustering
Classification
Content-based
Collaborative filtering
The recommendation is based on attributes of item
Articles rec.
Music rec.
Possible attributes for recommendation
Content
Journalist
Category
Published date
...
Which ones make sense
Newspaper format : content
Forum format: ?
Possible attributes for recommendation
Singer
Album
Genre: pop, rock, etc.
...
Which ones make sense
All of the above
The take away: content based rec. varies from domains to domains and Mahout doesn't have much to say about it
Based on, and only on, knowledge of users’ relationships
to items. The items and users are black-boxes
Article rec.
Given
2 - Clue 2
1 - Clue 1
Question
What do you recommend for user 3 ?
Real life problem
The data
Traditional Mahout's recommenders
User based
Item based
Slope-one
New and experimental recommenders
SVD (singular value decomposition)
Knn (k nearest neighbors)
Tree clustering
User based rec.
The code
Explanation
Data model
Similarity
Neighborhood
Efficiently design for big data processing
Has various implementations
GenericDataModel: in memory
FileDataModel: refreshable, update-able
Database DataModel
BooleanPreferences
Similarity metric is the notion of sameness between two things, whether they’re users or items.
Answer the question: how much common do we share?
Supported similarity metrics
Pearson correlation–based
Employing weighting
Euclidean distance
Adapting the cosine measure
Spearman correlation
Tanimoto coefficient
Log-likelihood
The problem w/ exploring for every user
Terribly slow
Doesn't add much value
Sometimes gives noises and decreases the recommendation quality
Types of neighborhoods
Fixed size: the top n similar users
Threshold-base: users that are >95% similar
Euclidean distance
Cosine similarity
Measure the linear correlation (dependence) between two variables X and Y
Implementation is based on the distance between user points in space of item vectors
Pearson correlation
Correlation between User 1 & User 2
Correlation between User 1 & User 5
What does real life data look like?
Address some problems with Pearson correlation
It doesn’t take into account the number of items in which two users’ preferences overlap
If two users overlap on only one item
The correlation is also undefined if either series of preference values are all identical
The solution is to push positive correlation values toward 1.0 and negative
values toward –1.0 when the correlation is based on more items
Pearson Correlation similarity w/ weighting
3 dimensions - 3 items
n dimensions
2 dimensions - 2 items
Spearman correlation
Tanimoto coefficient
Log-likelihood based
Item based rec.
Similarity metrics
Similar to Euclidean distance: depends on envisioning user preferences as points in space
The value is calculated based on the angle between 2 users' vector
A variant of Pearson correlation: the original value is transformed to relative rank
Ignore preference values completely
Is measured by:
s = intersection / total
An improvement to Tanimoto
Is natural logarithm of likelihood function
Likelihood function in a nutshell
Probability
Likelihood
Probability vs Likelihood
Describe a function of outcome given a fixed parameter value.

For example, if a coin is flipped 100 times and it is a fair coin, what is the probability of it landing heads-up every time?
Describing a function of a parameter given an outcome.

For example, if a coin is flipped 100 times and it has landed heads-up 100 times, what is the likelihood that the coin is fair?
Unlike other calculation, even User 1 and User 1 doesn't have 100% percent similarity
The big pictủure
Open Source project
Writing and running distributed applications that work with BigData

Accessible
Simple
Robust
Scalable
Characteristics
Without Hadoop
With Hadoop
Code that runs in one machine
Writing distributed code (manually)
The problems
Store file over many processing machines
Store data on hard disk, not on memory
Partition the intermediate data from Phase 1
Shuffle the partitions to machines in Phase 2
...
The mapping phase:
What is Hadoop
Real world example
word count
"Do as I say, not as I do"
Hadoop filter and transform the input to <k2,v2>
The reducing phase
The input
Key value pair <k1,v1>
All key sharing same k2 will be aggregate <k2, list<v2>)
Aggregate the input from mapper to product the desired output <k3, v3>
In this case it's <filename, document>:
<"_foo", "do as I say, not as I do"
Foursquare
Yahoo! Mail
Make Machine Learning accessible
Contribute-able
The same techniques can be used for Item based recommendation
Slope one rec.
The big idea
Base on the assumption that there's a linear relationship between items
i2 = m*i1 + b
Another reasonable assumption
m = 1 ( slope one)
Pros
Cons
Evaluation
Fast
Simplicity
Effective
Can be calculated offline
Updating data is simple
Memory consumption
Hadoop's daemon
HDFS architecture
Job architecture
NameNode
Secondary NameNode
DataNode
JobTracker
TaskTracker
Putting it all together
3 modes
Local (standalone) mode
Pseudo distributed mode
Fully distrubuted mode
Other experimental
recommenders
SVD
Knn based
Cluster based
Testing & evaluation
The concept
Seperate the data into 2 set:
Use one part for training
The other part for testing against the calculated result
Framework support
Famous last words
The take away
Full transcript