Using R for Data Mining Competitions

An GTA R Users Group talk »
Jonathan Lee

Speed up computation
Bypass memory limitations

Quite easy for independent simulations





Not so simple for non-independent simulations
Using              for 
Data Mining Competitions
Case Studies
Why Data Mining Competitions?
Why R?
Jonathan Lee, University of Western Ontario
GTA R Users Group
November 25, 2011
Ryerson University
Additional Resources
2007
2009
2008
October 2, 2007
Ready-to-use data
Structured goal
No accountability, no stress
Recognition and prizes ($$)
Publication opportunity
Fun!
So Where Are They?
Great for You
A Bit of History
Low-cost way to get innovative solutions
Crowd sourcing
Ownership of solution
Great for Companies
Benefits
open source
cross-platform
powerful
de-facto standard
Prezi (this presentation software)
          http://prezi.com
R-bloggers 
          http://www.r-bloggers.com
Kaggle blog: No Free Hunch
          http://blog.kaggle.com/

R Package Recommendation Engine
Tourism Forecasting
HIV Viral Load Prediction
The Usual Suspects
There's a package for that
Isn't (insert language here) Faster?
Run-time? Perhaps.
Development time? Probably not!

Packages to note: compiler, Rcpp
The               Prize
Not the first data mining competition, but probably the most well-known.
2006
2010
Netflix is an online movie rental service
Can we predict a Netflix user's rating for a movie given previous movie rating data?
The Problem
The Players
The Journey
The Data
The Goal
Data Mining 
Competitions
Competition
Platforms
Individual
Competitions
An incomplete list...
Do better than Netflix's existing system, Cinematch, by at least 10%
 Cinematch uses simple linear models with data conditioning

Grand prize: $1 million USD
Anyone could join


Computer scientists, statisticians, programmers etc.
51051 contestants on 41305 teams from 186 countries. 
Training Set
(~100 million entries)
Quiz set (~1 million entries)
Test set (~1 million entries)
Model tested on quiz set to determine leaderboard ranking
Competition deadline set once 10% improvement is reached
At end of competition, models with at least 10% improvement are tested on test set.
Best performing model on test set is declared the winner.
All data including answers are provided
Models are fit onto this data
There are so many packages that exist for R
Impossible to know all of them -- so which ones should one invest time in learning?
Cinematch for R
The Problem
Any 3 books from the UseR! book list (postage included)
Winner is determined              by highest AUC value               on test set
The Prize
The Data
Voluntary sample of 52 R users
~100,000 rows of data for 1865 packages
Each row contains metadata of package and whether or not a user has that package installed.

PackageName                The name of current R package
DependencyCount         Number of other packages that have it as a dependency
SuggestionCount           Number of other packages that suggest it
ViewsIncluding              Number of views that include it
CorePackage                  Whether or not it is a core package (binary)
RecommendedPackage  Whether or not it is a recommended package (binary)
Maintainer                     Name and e-mail of package maintainer
PackagesMaintaining    How many other packages the maintainer maintains
Training set (2/3)
Test set (1/3)
Add in automatic stepwise model selection...




AUC on test set: 0.9453
A better model
Synthesize better predictor variables

Add a little bit of intuition


AUC: 0.9682
An even better model
A winning model
AUC: Area Under the (ROC) Curve
For a binary classification problem, we can get a confusion matrix.

Receiver Operator Characteristic (ROC) space is defined as the True Positive Rate vs the False Positive Rate.

Our predicted values are not binary, but rather a probability between 0 and 1.
By evaluating the ROC at every classification threshold, we get an ROC curve.
Positive
Negative
Positive
Negative
True Positive
False Positive
True Negative
False Negative
Actual
Predicted
## Fit model on training data
logit.fit <- glm(Installed ~ (LogDependencyCount +
                             LogSuggestionCount +
                             LogImportCount +
                             ViewsIncluding)^2 +
                             factor(User) -
                             LogSuggestionCount:ViewsIncluding -
                             LogImportCount:ViewsIncluding,
                 data = train.data,
                 family = binomial(link = 'logit'))

#### Predicting the test data
logit.predict <- predict.glm(logit.fit, newdata=test.data, type='response')
If a package is installed, assume all dependencies and imports are installed
Assume all core packages are installed

Logistic regression (stats package) on predictor variables with some variable transformation


AUC on test set: 0.8178
A simple model
## Fit the model to the training data
logit.fit <- glm(Installed ~ LogDependencyCount +
                             LogSuggestionCount +
                             LogImportCount +
                             LogViewsIncluding +
                             LogPackagesMaintaining +
                             CorePackage +
                             RecommendedPackage +
                             factor(User),
                 data = train.data,
                 family = binomial(link = 'logit'))

## Use model to predict on the test set
logit.predict <- predict.glm(logit.fit,newdata=test.data,type='response')
Jonathan Lee
E-mail: jlee253@uwo.ca
Website: http://www.stats.uwo.ca/gradwebs/jlee/

This presentation and slides will be made available on my blog: http://www.compmath.com
My Contact
from
IJCNN Social Network Challenge
Predicting edges in an online social network
If you and I are both connected to Bob, what is the probability (p) that I am connected to you?
The Problem
Free registration for two to the IJCNN 2011 Conference and chance to present your solution. ($950 value)
Winner is determined by highest AUC value on test set
The Prize
The Data
1,133,547 nodes (users)

7,246,943 edges (contacts)

No additional covariates given

What social network generates this type of graph?
Directed graph

0.1 % used as test set
A winning model
Number of mutual friends



p=1 if within two degrees of separation
p=0 otherwise
AUC on test set: 0.7529
Two Degrees of Separation
Inbound nodes
1,133,518
Inbound and Outbound nodes
37,689
Outbound nodes
29
for (i in 1:test.size) {
  a <- test$V1[i]
  b <- test$V2[i]

  # Get all nodes that connect to a
  a.list <- c(train$V1[which(train$V2==a)],train$V2[which(train$V1==a)])

  # Get all nodes that connect to b
  b.list <- c(train$V1[which(train$V2==b)],train$V2[which(train$V1==b)])

  # Find number of mutual nodes
  mutual[i] <- sum(a.list %in% b.list)
}

Can't feasibly find 3+ degrees of separation
Let's put this into a graph (igraph package) and find the degrees of separation

Inverse function of degrees of separation?

AUC on test set: 0.832
n Degrees of Separation
library(igraph)
edges <- matrix(c(train$V1,train$V2),ncol=2,byrow=FALSE) 
sngraph <- graph.edgelist(edges,directed=TRUE)
get.path.lengths <- function(v1,v2,graph) length(get.shortest.paths(sngraph,v1,v2,mode="all")[[1]])-1
path.lengths <- mapply(get.path.lengths,test$V1,test$V2,MoreArgs=list(graph=sngraph))
Having a graph opens up a lot of possible covariates



After logistic regression... 
AUC on test set: 0.8704
More covariates
Number of friends
Shortest path
Number of shortest paths etc.
Realized the dataset came from
De-anonymized the users (not an easy task!)
Recreated the missing edges (60%)
Used randomForests on standard social network features to fill in the rest

So why not 100% accuracy?
Social networks evolve over time.
AUC of 0.9812!
How did they do it?
Going Beyond the Data
INFORMS 2011
The Heritage Health Prize
Predict the number of days a patient will spend in the hospital in the next year given historical claims data.

$3M grand prize + $240k progress
Ends April 2013
Big Data
Netflix data - a matrix of about 480,000 members' ratings for about 18,000 movies) was about 65 GB!

Matrix package using sparseMatrix( ) function efficiently stores the data in ~800 MB

irlba package can be used to efficiently do SVD and PCA for large sparse matricies
Data Visualization
Machine Learning
ahaz
arules
BayesTree
Boruta
BPHO
caret
class
CORElearn
CoxBoost
Cubist
e1071 
earth
elasticnet

Model Evaluation
Most packages have built in model evaluation tools
 ROCR: powerful and easy-to-use package for visualizing classifier performance
How can we collaborate?
Code hosting (with version control)
  Github http://www.github.com
  SourceForge http://www.sourceforge.net
  R-forge http://r-forge.r-project.org
RGoogleDocs package

Data Mining is a Team Sport
The Elements of Statistical Learning
        http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Stanford's Machine Learning Course
        http://www.ml-class.org 

Data Mining and 
Machine Learning
require(tm)
require(wordcloud)

data(crude)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, function(x)removeWords(x,stopwords()))
tdm <- TermDocumentMatrix(crude)
v <- sort(rowSums(as.matrix(tdm)),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

wordcloud(d$word,d$freq)
wordcloud
Some examples from R graph gallery: http://addictedtor.free.fr/graphiques/
anaglyph
require(anaglyph)
attach(mtcars)

anaglyph.plot(x=wt,y=mpg,z=disp,
     main="Weight vs Miles Per Gallon vs Displacement",
      xlab="Weight (lb/1000lb)", 
      ylab="Miles/US Gallon")
by Ian Fellows
by Jonathan Lee
RnavGraph
by Adrian Waddell
lattice
require(lattice)

histogram( ~ height | voice.part, data = singer,
          xlab = "Height (inches)", type = "density",
          panel = function(x, ...) {
              panel.histogram(x, ...)
              panel.mathdensity(dmath = dnorm, col = "black",
                                args = list(mean=mean(x),sd=sd(x)))
          } )
by Deepayan Sarkar
ggplot2
require(ggplot2)

ggplot(diamonds, aes(depth, fill = cut)) 
  + geom_density(alpha = 0.2) 
  + xlim(55, 70)
by Hadley Wickham
rgl
by Daniel Adler and Duncan Murdoch
Example: Netflix
Parallel R
 Rmpi
 snow
 multicore
 parallel (default in version 2.14)
Commercial software 
      built on top of R
Optimized for Multi-processor 
      workstations
Multi-threaded Math libraries 
Terabyte file structures and 
      optimized algorithms for "Big Data"

Free for academic users


Revolution R
More info: http://www.revolutionanalytics.com/products/enterprise-big-data.php
library(caret)

model.1 <- rfe(hiv.412[,c(5:617)], as.factor(hiv.412[,618]), 
                         sizes = c(1:10, 15, 30, 60, 90, 120, 150, 200, 250, 300, 350),
                         rfeControl = rfeControl(functions = rfFuncs , method="cv"))
model.1.prediction <- predict(model.1$fit, hiv.692)
Winning Algorithm
library(RandomForest)
 fin.train.1e.5050 <- randomForest(as.factor(Resp) ~ VL.t0  + rt184 + CD4.t0 + rt215 +
    rt98 + rt225 + rt35 + rt151 + rt207 + rt227 + pr12 + pr70 +
    rt101 + pr71 + rt69 + rt75 + rt187 + rt203 + rt41 + pr60 +
    pr20 + rt43 + rt179 + pr82 + pr46 + pr90 + rt31 + pr30 +
    pr10 + rt44 + pr58 + rt121 + pr73 + rt195 + pr63 + rt109 +
    pr54 + rt173 + pr88 + rt205 + rt190 + rt77 + pr13 + rt115 +
   rt106 + pr32 + pr14 + rt170 + rt4 + pr92 + pr35,
na.action=na.omit, data=yellow.5050, mtry=15)

The Problem
Single Hidden Layer NN
Netflix Prize contest announced.
October 8, 2007
Cinematch performance surpassed.
October 15, 2007
1% improvement reached.
November 13, 2007
BellKor awarded $50,000 progress prize 8.43% improvement
Description of algorithm published
December 10, 2008
BellKor in Big Chaos awarded $50,000 progress prize
9.44% improvement
Description of algorithm published
June 26, 2009
BellKor's Pragmatic Chaos reaches 10.05% improvement.
"Last call" for final submissions.
July 27, 2009
BellKor's Pragmatic Chaos wins grand prize 
($1,000,000)
September 21, 2009
Accuracy, error rate, true positive rate, false positive rate, true negative rate, false negative rate, sensitivity, specificity, recall, positive predictive value, negative predictive value, precision, fallout, miss, phi correlation coefficient, Matthews correlation coefficient, mutual information, chi square statistic, odds ratio, lift value, precision/recall F measure, ROC convex hull, area under the ROC curve, precision/recall break-even point, calibration error, mean cross-entropy, root mean squared error, SAR measure, expected cost, explicit cost.
http://rocr.bioinf.mpi-sb.mpg.de/
ROC curves, precision/recall plots, lift charts, cost curves, custom curves by freely selecting one performance measure for the x axis and one for the y axis, handling of data from cross-validation or bootstrapping, curve averaging (vertically, horizontally, or by threshold), standard error bars, box plots, curves that are color-coded by cutoff, printing threshold values on the curve, tight integration with Rs plotting facilities (making it easy to adjust plots or to combine multiple plots), fully customizable, easy to use (only 3 commands).
Supported measures
Features
Check out the MachineLearning Task View on CRAN
ElemStatLearn
gafit
GAMBoost
gbm 
glmnet
glmpath
grplasso
hda
igraph
ipred
kernlab 
klaR




party
penalized
penalizedSVM
predbayescor
quantregForest
randomForest 
randomSurvivalForest
rda
rdetools
REEMtree
relaxo
rgenoud


lars
lasso2
lda
LiblineaR
LogicForest
LogicReg
ltr
mboost 
mvpart
ncvreg
nnet 
pamr



rgp
rminer
ROCR
rpart 
RSNNS
RWeka
sda
SDDA
svmpath
tgp
tree
TWIX
varSelRF
http://cran.r-project.org/web/views/MachineLearning.html
http://cran.r-project.org/web/views/HighPerformanceComputing.html
Max Lin's 2nd place solution: Ensemble Learning with 4 classifiers
Source: https://github.com/m4xl1n/r_recommendation_system
Take training data and create an out-of-sample test set
1
Training Data (80%)
Test (10%)
Test (10%)
2
Train classifiers with training data
3
Use classifiers to predict test data
4
Fit ensemble model using logistic regression on test data (ROCR)
x 10-folds
10
=0.9833
logistic regression (stats)
latent factor model (ltr)
latent Dirichlet allocation on topic (lda)
latent Dirichlet allocation on task view (lda)
logistic regression
(stats)
+ intuition
logistic regression (stats)
latent factor model (ltr)
latent Dirichlet allocation on topic (lda)
latent Dirichlet allocation on task view (lda)
AUC
Hidden Layer
Input Layer
Output Layer
The Problem
Predict stock price movements at 5-minute intervals
Useful for high-frequency traders
Performance measured by AUC
The Data
Time series with 609 blind explanatory variables for an unknown stock
stock prices
sectoral data
economic data
experts' predictions
indexes
Training set: 5922 observations taken at 5-minute intervals.
Test set: 2539 observations taken at 5-minute intervals.
V74CA0
V74HA1
V74CD12
V74LA13
V88CA1
V88H12A12
Logisitic Regression
Feature selection done to select top two significant variables at lag 12


Further significance noticed at lag 1


2 variables * 4 prices * 4 lags = 32 covariates
V74 and V88
Each variable has 4 prices: Open, High, Low, Last

Take lag 0, 1, 12, 13
Model Selection
Manual model selection results in logistic regression model of the form...





Target ~ V74LASTAdv0
             +V74HIGHAdv1
             +V74LASTAdv12
             +V74LASTAdv13
             +V88LASTAdv1
             +V88HIGHAdv12
= 0.9908
Artificial Neural Network
 nnet package to fit a feed-forward neural network with a single hidden layer
Using the covariates from the previous logisitic regression model, we get...
AUC
= 0.9898
AUC
Missing Data
Many features had missing data
Solutions?


Impute it
           impute package (i.e. knn.impute)
Ignore it
Remove it
I liked                , will I like              ?
Cole Harris (Team DejaVu)'s Winning Solution
True Crowdsourcing
Predict the likelihood that an HIV patient's infection will become less severe, given a small dataset and limited clinical information.
Chris Raimondi
 caret package for recursive feature elimination - rfe( )


 RandomForest to fit model
by Arvind Narayanan, Elaine Shi, Ben Rubinstein, and Yong J Kil
Full description: http://blog.kaggle.com/2011/01/15/how-we-did-it-the-winners-of-the-ijcnn-social-network-challenge/
Predicting tourism related time series
Miscellaneous

Loading comments...

Please log in to add your comment.

Report abuse