Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Radoop - Big Data Analytics with RapidMiner and Hadoop

RCOMM 2013, Porto, Portugal

Zoltan Prekopcsak

on 17 November 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Radoop - Big Data Analytics with RapidMiner and Hadoop

Radoop - Big Data Analytics with RapidMiner and Hadoop
Zoltan Prekopcsak
2011 June - Radoop prototype presented in Dublin
2011 December - Radoop company founded
2012 July - Radoop 1.0 released
2013 August - Radoop 1.6 released
Radoop let you work with
very large datasets
in RapidMiner by storing data and running computations on a
Hadoop cluster
Big Data
"Big data is the term for data sets so large and complex that it becomes
difficult to process


data processing
1M rows
4-8 GB
50-100 GB
1 TB
High-Memory Cluster instance
244 GB RAM, $3.5/hour
Why should I collect and store so much data?
S3 active storage: ~$1/GB/year
Glacier passive storage: ~$0.1/GB/year
Online presentation software
~150 employees, ~27M users, 100TB+ event logs
A/B testing product features, content analysis, user segmentation
R&D company
~50 employees
Collecting detailed web browsing data
Analyzing user behavior for e-commerce
Small US company
3 employees, millions of users, 100GB/day
Creating browser extensions, collecting behavior data
Ad targeting, product recommendations
Internet companies, telcos (CDRs), genetics, manufacturing (sensors), ...
Distributed (MapReduce) operators
Import/Export data to/from Hadoop
Read CSV, Read Database
Write CSV, Write Database
Retrieve/Store/Append to Hive

Data transformations
Select Attributes, Filter Examples
Generate Attributes, Generate ID
Aggregate, Join, Sort
Replace, Replace/Declare Missing Values
Hive/Pig Script

Machine learning & Statistical modeling
Clustering: K-Means, Fuzzy K-Means, Dirichlet, Canopy
Model learning: Naive Bayes
Model scoring: Naive Bayes, Decision Tree, Logistic Regression, Linear Regression
Evaluation: Performance
Radoop 2011 vs 2013 by the numbers
Radoop 2011 vs 2013 - subjective
2011: a research prototype that works sometimes
2013: a reliable product that works all the time
most of
stability, user experience
all RapidMiner goodness: breakpoints, metadata validation, quick fixes
support for all Hadoop distros and different setups
data management perspective
RapidAnalytics integration
Ensemble learning (Random Forest)
Bagging: random sampling with replacement

Subspace method: random feature selection for each node

Thank you for your attention!
Full transcript