Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Machine learning performance experiments with Spark MLLib

Hadoop Summit Europe 2015
by

Zoltan Prekopcsak

on 3 June 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Machine learning performance experiments with Spark MLLib

?
Machine learning performance experiments with Spark MLlib
Zoltan Prekopcsak
VP Big Data

Machine Learning Use Cases
Machine Learning Tools
Data vs Model
Recommendation
Anomaly detection
Clustering
Classification
MLlib
The Unreasonable Effectiveness of Data
Halevy, Norvig and Pereira (2009), IEEE Intelligent Systems
Predictive Modeling with Big Data
Junqué de Fortuny, Martens, Provost (2013), Big Data Journal
?
High-dimensional, sparse data
10M+
values
zprekopcsak@rapidminer.com
@prekopcsak

High-dimensional, sparse data
5
5
3
4
Anomalous behavior
depends on what is considered normal
typical causes
data collection error / exclude
data handling error / fix
real malfunction, error / alert
supervised / unsupervised detection
Large data set
Typically few interesting records
Balanced sampling may help for supervised detection
100M+
values
100M+
values
10K-
cluster
size
Different types of clustering:
centroid-based (e.g. k-means)
distribution-based
hierarchical
density-based
subsample
100M+
values
10+
features
or
Estimating the mean:
99% confidence interval
±2.5% of standard deviation
Sample size: ~10K
and regression
Several algorithms:
Naive Bayes
Decision Tree
Random Forest
Model
100M+
values
learn
curve
Experiment with MLlib Logistic Regression
Flight data set
120M+ records
6 numerical attributes
predicting flight being 5m+ late
about 30% late
100% training
How does the prediction change with less training data?
Gold Standard
50% training
1 diff in every 70,000
30% training
1 diff in every 50,000
1 diff in every 20,000
1% training
99.995% same
1.2M records
Notes
Just a single experiment, other data sets may be different
In general, low-dimensional classification rarely needs large-scale ML
The more model parameters, the more differences
Experiment with K-Means clustering
How does the clustering change with less training data?
Generated clustering data
10M records
5 numerical attributes
trying to identify 5 different clusters
Notes
Just a single experiment, other data sets may be different
In general, low-dimensional clustering does not need large-scale execution
The more dimensions, the more differences
100% training
Gold Standard
50% training
30% training
1% training
1 diff in every 2,000,000
1 diff in every 50,000
1 diff in every 400,000
Full transcript