Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Do you really want to delete this prezi?
Neither you, nor the coeditors you shared it with will be able to recover it again.
Make your likes visible on Facebook?
Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.
Radoop - Big Data Analytics with RapidMiner and Hadoop
Transcript of Radoop - Big Data Analytics with RapidMiner and Hadoop
2011 June - Radoop prototype presented in Dublin
2011 December - Radoop company founded
2012 July - Radoop 1.0 released
2013 August - Radoop 1.6 released
Radoop let you work with
very large datasets
in RapidMiner by storing data and running computations on a
"Big data is the term for data sets so large and complex that it becomes
difficult to process
High-Memory Cluster instance
244 GB RAM, $3.5/hour
Why should I collect and store so much data?
S3 active storage: ~$1/GB/year
Glacier passive storage: ~$0.1/GB/year
Online presentation software
~150 employees, ~27M users, 100TB+ event logs
A/B testing product features, content analysis, user segmentation
Collecting detailed web browsing data
Analyzing user behavior for e-commerce
Small US company
3 employees, millions of users, 100GB/day
Creating browser extensions, collecting behavior data
Ad targeting, product recommendations
Internet companies, telcos (CDRs), genetics, manufacturing (sensors), ...
Distributed (MapReduce) operators
Import/Export data to/from Hadoop
Read CSV, Read Database
Write CSV, Write Database
Retrieve/Store/Append to Hive
Select Attributes, Filter Examples
Generate Attributes, Generate ID
Aggregate, Join, Sort
Replace, Replace/Declare Missing Values
Machine learning & Statistical modeling
Clustering: K-Means, Fuzzy K-Means, Dirichlet, Canopy
Model learning: Naive Bayes
Model scoring: Naive Bayes, Decision Tree, Logistic Regression, Linear Regression
Radoop 2011 vs 2013 by the numbers
Radoop 2011 vs 2013 - subjective
2011: a research prototype that works sometimes
2013: a reliable product that works all the time
stability, user experience
all RapidMiner goodness: breakpoints, metadata validation, quick fixes
support for all Hadoop distros and different setups
data management perspective
Ensemble learning (Random Forest)
Bagging: random sampling with replacement
Subspace method: random feature selection for each node
Thank you for your attention!