Radoop - Big Data Analytics with RapidMiner and Hadoop

RCOMM 2013, Porto, Portugal

Zoltan Prekopcsak

on 17 November 2014

Transcript of Radoop - Big Data Analytics with RapidMiner and Hadoop

Radoop - Big Data Analytics with RapidMiner and Hadoop
Zoltan Prekopcsak
2011 June - Radoop prototype presented in Dublin
2011 December - Radoop company founded
2012 July - Radoop 1.0 released
2013 August - Radoop 1.6 released
Radoop let you work with
very large datasets
in RapidMiner by storing data and running computations on a
Hadoop cluster
Big Data
"Big data is the term for data sets so large and complex that it becomes
difficult to process


data processing
1M rows
4-8 GB
50-100 GB
1 TB
High-Memory Cluster instance
244 GB RAM, $3.5/hour
Why should I collect and store so much data?
S3 active storage: ~$1/GB/year
Glacier passive storage: ~$0.1/GB/year
~150 employees, ~27M users, 100TB+ event logs
A/B testing product features, content analysis, user segmentation
R&D company
~50 employees
Collecting detailed web browsing data
Analyzing user behavior for e-commerce
Small US company
3 employees, millions of users, 100GB/day
Creating browser extensions, collecting behavior data
Ad targeting, product recommendations
Internet companies, telcos (CDRs), genetics, manufacturing (sensors), ...
Distributed (MapReduce) operators
Import/Export data to/from Hadoop
Read CSV, Read Database
Write CSV, Write Database
Retrieve/Store/Append to Hive

Data transformations
Select Attributes, Filter Examples
Generate Attributes, Generate ID
Aggregate, Join, Sort
Replace, Replace/Declare Missing Values
Hive/Pig Script

Machine learning & Statistical modeling
Clustering: K-Means, Fuzzy K-Means, Dirichlet, Canopy
Model learning: Naive Bayes
Model scoring: Naive Bayes, Decision Tree, Logistic Regression, Linear Regression
Evaluation: Performance
Radoop 2011 vs 2013 by the numbers
Radoop 2011 vs 2013 - subjective
2011: a research prototype that works sometimes
2013: a reliable product that works all the time
most of
stability, user experience
all RapidMiner goodness: breakpoints, metadata validation, quick fixes
support for all Hadoop distros and different setups
data management perspective
RapidAnalytics integration
Ensemble learning (Random Forest)
Bagging: random sampling with replacement

Subspace method: random feature selection for each node

Thank you for your attention!
