Speed up computation

Bypass memory limitations

Quite easy for independent simulations

Not so simple for non-independent simulations Using for

Data Mining Competitions Case Studies Why Data Mining Competitions? Why R? Jonathan Lee, University of Western Ontario

GTA R Users Group

November 25, 2011

Ryerson University Additional Resources 2007 2009 2008 October 2, 2007 Ready-to-use data

Structured goal

No accountability, no stress

Recognition and prizes ($$)

Publication opportunity

Fun! So Where Are They? Great for You A Bit of History Low-cost way to get innovative solutions

Crowd sourcing

Ownership of solution Great for Companies Benefits open source

cross-platform

powerful

de-facto standard Prezi (this presentation software)

http://prezi.com

R-bloggers

http://www.r-bloggers.com

Kaggle blog: No Free Hunch

http://blog.kaggle.com/ R Package Recommendation Engine Tourism Forecasting HIV Viral Load Prediction The Usual Suspects There's a package for that Isn't (insert language here) Faster? Run-time? Perhaps.

Development time? Probably not!

Packages to note: compiler, Rcpp The Prize Not the first data mining competition, but probably the most well-known. 2006 2010 Netflix is an online movie rental service

Can we predict a Netflix user's rating for a movie given previous movie rating data? The Problem The Players The Journey The Data The Goal Data Mining

Competitions Competition

Platforms Individual

Competitions An incomplete list... Do better than Netflix's existing system, Cinematch, by at least 10%

Cinematch uses simple linear models with data conditioning

Grand prize: $1 million USD Anyone could join

Computer scientists, statisticians, programmers etc. 51051 contestants on 41305 teams from 186 countries. Training Set

(~100 million entries) Quiz set (~1 million entries) Test set (~1 million entries) Model tested on quiz set to determine leaderboard ranking

Competition deadline set once 10% improvement is reached At end of competition, models with at least 10% improvement are tested on test set.

Best performing model on test set is declared the winner. All data including answers are provided

Models are fit onto this data There are so many packages that exist for R

Impossible to know all of them -- so which ones should one invest time in learning?

Cinematch for R The Problem Any 3 books from the UseR! book list (postage included)

Winner is determined by highest AUC value on test set The Prize The Data Voluntary sample of 52 R users

~100,000 rows of data for 1865 packages

Each row contains metadata of package and whether or not a user has that package installed. PackageName The name of current R package

DependencyCount Number of other packages that have it as a dependency

SuggestionCount Number of other packages that suggest it

ViewsIncluding Number of views that include it

CorePackage Whether or not it is a core package (binary)

RecommendedPackage Whether or not it is a recommended package (binary)

Maintainer Name and e-mail of package maintainer

PackagesMaintaining How many other packages the maintainer maintains Training set (2/3) Test set (1/3) Add in automatic stepwise model selection...

AUC on test set: 0.9453 A better model Synthesize better predictor variables

Add a little bit of intuition

AUC: 0.9682 An even better model A winning model AUC: Area Under the (ROC) Curve For a binary classification problem, we can get a confusion matrix.

Receiver Operator Characteristic (ROC) space is defined as the True Positive Rate vs the False Positive Rate.

Our predicted values are not binary, but rather a probability between 0 and 1.

By evaluating the ROC at every classification threshold, we get an ROC curve. Positive Negative Positive Negative True Positive False Positive True Negative False Negative Actual Predicted ## Fit model on training data

logit.fit <- glm(Installed ~ (LogDependencyCount +

LogSuggestionCount +

LogImportCount +

ViewsIncluding)^2 +

factor(User) -

LogSuggestionCount:ViewsIncluding -

LogImportCount:ViewsIncluding,

data = train.data,

family = binomial(link = 'logit'))

#### Predicting the test data

logit.predict <- predict.glm(logit.fit, newdata=test.data, type='response') If a package is installed, assume all dependencies and imports are installed

Assume all core packages are installed Logistic regression (stats package) on predictor variables with some variable transformation

AUC on test set: 0.8178 A simple model ## Fit the model to the training data

logit.fit <- glm(Installed ~ LogDependencyCount +

LogSuggestionCount +

LogImportCount +

LogViewsIncluding +

LogPackagesMaintaining +

CorePackage +

RecommendedPackage +

factor(User),

data = train.data,

family = binomial(link = 'logit'))

## Use model to predict on the test set

logit.predict <- predict.glm(logit.fit,newdata=test.data,type='response') Jonathan Lee

E-mail: jlee253@uwo.ca

Website: http://www.stats.uwo.ca/gradwebs/jlee/

This presentation and slides will be made available on my blog: http://www.compmath.com My Contact from IJCNN Social Network Challenge Predicting edges in an online social network

If you and I are both connected to Bob, what is the probability (p) that I am connected to you? The Problem Free registration for two to the IJCNN 2011 Conference and chance to present your solution. ($950 value)

Winner is determined by highest AUC value on test set The Prize The Data 1,133,547 nodes (users)

7,246,943 edges (contacts)

No additional covariates given

What social network generates this type of graph? Directed graph

0.1 % used as test set A winning model Number of mutual friends

p=1 if within two degrees of separation

p=0 otherwise

AUC on test set: 0.7529 Two Degrees of Separation Inbound nodes

1,133,518 Inbound and Outbound nodes

37,689 Outbound nodes

29 for (i in 1:test.size) {

a <- test$V1[i]

b <- test$V2[i]

# Get all nodes that connect to a

a.list <- c(train$V1[which(train$V2==a)],train$V2[which(train$V1==a)])

# Get all nodes that connect to b

b.list <- c(train$V1[which(train$V2==b)],train$V2[which(train$V1==b)])

# Find number of mutual nodes

mutual[i] <- sum(a.list %in% b.list)

} Can't feasibly find 3+ degrees of separation

Let's put this into a graph (igraph package) and find the degrees of separation

Inverse function of degrees of separation?

AUC on test set: 0.832 n Degrees of Separation library(igraph)

edges <- matrix(c(train$V1,train$V2),ncol=2,byrow=FALSE)

sngraph <- graph.edgelist(edges,directed=TRUE)

get.path.lengths <- function(v1,v2,graph) length(get.shortest.paths(sngraph,v1,v2,mode="all")[[1]])-1

path.lengths <- mapply(get.path.lengths,test$V1,test$V2,MoreArgs=list(graph=sngraph)) Having a graph opens up a lot of possible covariates

After logistic regression...

AUC on test set: 0.8704 More covariates Number of friends

Shortest path

Number of shortest paths etc. Realized the dataset came from

De-anonymized the users (not an easy task!)

Recreated the missing edges (60%)

Used randomForests on standard social network features to fill in the rest

So why not 100% accuracy?

Social networks evolve over time. AUC of 0.9812!

How did they do it? Going Beyond the Data INFORMS 2011 The Heritage Health Prize Predict the number of days a patient will spend in the hospital in the next year given historical claims data.

$3M grand prize + $240k progress

Ends April 2013 Big Data Netflix data - a matrix of about 480,000 members' ratings for about 18,000 movies) was about 65 GB!

Matrix package using sparseMatrix( ) function efficiently stores the data in ~800 MB

irlba package can be used to efficiently do SVD and PCA for large sparse matricies Data Visualization Machine Learning ahaz

arules

BayesTree

Boruta

BPHO

caret

class

CORElearn

CoxBoost

Cubist

e1071

earth

elasticnet Model Evaluation Most packages have built in model evaluation tools

ROCR: powerful and easy-to-use package for visualizing classifier performance How can we collaborate?

Code hosting (with version control)

Github http://www.github.com

SourceForge http://www.sourceforge.net

R-forge http://r-forge.r-project.org

RGoogleDocs package Data Mining is a Team Sport The Elements of Statistical Learning

http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Stanford's Machine Learning Course

http://www.ml-class.org Data Mining and

Machine Learning require(tm)

require(wordcloud)

data(crude)

crude <- tm_map(crude, removePunctuation)

crude <- tm_map(crude, function(x)removeWords(x,stopwords()))

tdm <- TermDocumentMatrix(crude)

v <- sort(rowSums(as.matrix(tdm)),decreasing=TRUE)

d <- data.frame(word = names(v),freq=v)

wordcloud(d$word,d$freq) wordcloud Some examples from R graph gallery: http://addictedtor.free.fr/graphiques/ anaglyph require(anaglyph)

attach(mtcars)

anaglyph.plot(x=wt,y=mpg,z=disp,

main="Weight vs Miles Per Gallon vs Displacement",

xlab="Weight (lb/1000lb)",

ylab="Miles/US Gallon") by Ian Fellows by Jonathan Lee RnavGraph by Adrian Waddell lattice require(lattice)

histogram( ~ height | voice.part, data = singer,

xlab = "Height (inches)", type = "density",

panel = function(x, ...) {

panel.histogram(x, ...)

panel.mathdensity(dmath = dnorm, col = "black",

args = list(mean=mean(x),sd=sd(x)))

} ) by Deepayan Sarkar ggplot2 require(ggplot2)

ggplot(diamonds, aes(depth, fill = cut))

+ geom_density(alpha = 0.2)

+ xlim(55, 70) by Hadley Wickham rgl by Daniel Adler and Duncan Murdoch Example: Netflix Parallel R Rmpi

snow

multicore

parallel (default in version 2.14) Commercial software

built on top of R

Optimized for Multi-processor

workstations

Multi-threaded Math libraries

Terabyte file structures and

optimized algorithms for "Big Data"

Free for academic users Revolution R More info: http://www.revolutionanalytics.com/products/enterprise-big-data.php library(caret)

model.1 <- rfe(hiv.412[,c(5:617)], as.factor(hiv.412[,618]),

sizes = c(1:10, 15, 30, 60, 90, 120, 150, 200, 250, 300, 350),

rfeControl = rfeControl(functions = rfFuncs , method="cv"))

model.1.prediction <- predict(model.1$fit, hiv.692) Winning Algorithm library(RandomForest)

fin.train.1e.5050 <- randomForest(as.factor(Resp) ~ VL.t0 + rt184 + CD4.t0 + rt215 +

rt98 + rt225 + rt35 + rt151 + rt207 + rt227 + pr12 + pr70 +

rt101 + pr71 + rt69 + rt75 + rt187 + rt203 + rt41 + pr60 +

pr20 + rt43 + rt179 + pr82 + pr46 + pr90 + rt31 + pr30 +

pr10 + rt44 + pr58 + rt121 + pr73 + rt195 + pr63 + rt109 +

pr54 + rt173 + pr88 + rt205 + rt190 + rt77 + pr13 + rt115 +

rt106 + pr32 + pr14 + rt170 + rt4 + pr92 + pr35,

na.action=na.omit, data=yellow.5050, mtry=15) The Problem Single Hidden Layer NN Netflix Prize contest announced. October 8, 2007 Cinematch performance surpassed. October 15, 2007 1% improvement reached. November 13, 2007 BellKor awarded $50,000 progress prize 8.43% improvement

Description of algorithm published December 10, 2008 BellKor in Big Chaos awarded $50,000 progress prize

9.44% improvement

Description of algorithm published June 26, 2009 BellKor's Pragmatic Chaos reaches 10.05% improvement.

"Last call" for final submissions. July 27, 2009 BellKor's Pragmatic Chaos wins grand prize

($1,000,000) September 21, 2009 Accuracy, error rate, true positive rate, false positive rate, true negative rate, false negative rate, sensitivity, specificity, recall, positive predictive value, negative predictive value, precision, fallout, miss, phi correlation coefficient, Matthews correlation coefficient, mutual information, chi square statistic, odds ratio, lift value, precision/recall F measure, ROC convex hull, area under the ROC curve, precision/recall break-even point, calibration error, mean cross-entropy, root mean squared error, SAR measure, expected cost, explicit cost. http://rocr.bioinf.mpi-sb.mpg.de/ ROC curves, precision/recall plots, lift charts, cost curves, custom curves by freely selecting one performance measure for the x axis and one for the y axis, handling of data from cross-validation or bootstrapping, curve averaging (vertically, horizontally, or by threshold), standard error bars, box plots, curves that are color-coded by cutoff, printing threshold values on the curve, tight integration with Rs plotting facilities (making it easy to adjust plots or to combine multiple plots), fully customizable, easy to use (only 3 commands). Supported measures Features Check out the MachineLearning Task View on CRAN ElemStatLearn

gafit

GAMBoost

gbm

glmnet

glmpath

grplasso

hda

igraph

ipred

kernlab

klaR party

penalized

penalizedSVM

predbayescor

quantregForest

randomForest

randomSurvivalForest

rda

rdetools

REEMtree

relaxo

rgenoud lars

lasso2

lda

LiblineaR

LogicForest

LogicReg

ltr

mboost

mvpart

ncvreg

nnet

pamr rgp

rminer

ROCR

rpart

RSNNS

RWeka

sda

SDDA

svmpath

tgp

tree

TWIX

varSelRF http://cran.r-project.org/web/views/MachineLearning.html http://cran.r-project.org/web/views/HighPerformanceComputing.html Max Lin's 2nd place solution: Ensemble Learning with 4 classifiers Source: https://github.com/m4xl1n/r_recommendation_system Take training data and create an out-of-sample test set 1 Training Data (80%) Test (10%) Test (10%) 2 Train classifiers with training data 3 Use classifiers to predict test data 4 Fit ensemble model using logistic regression on test data (ROCR) x 10-folds 10 =0.9833 logistic regression (stats) latent factor model (ltr) latent Dirichlet allocation on topic (lda) latent Dirichlet allocation on task view (lda) logistic regression

(stats) + intuition logistic regression (stats) latent factor model (ltr) latent Dirichlet allocation on topic (lda) latent Dirichlet allocation on task view (lda) AUC Hidden Layer Input Layer Output Layer The Problem Predict stock price movements at 5-minute intervals

Useful for high-frequency traders

Performance measured by AUC The Data Time series with 609 blind explanatory variables for an unknown stock stock prices

sectoral data

economic data

experts' predictions

indexes Training set: 5922 observations taken at 5-minute intervals.

Test set: 2539 observations taken at 5-minute intervals. V74CA0 V74HA1 V74CD12 V74LA13 V88CA1 V88H12A12 Logisitic Regression Feature selection done to select top two significant variables at lag 12

Further significance noticed at lag 1

2 variables * 4 prices * 4 lags = 32 covariates V74 and V88

Each variable has 4 prices: Open, High, Low, Last

Take lag 0, 1, 12, 13 Model Selection Manual model selection results in logistic regression model of the form... Target ~ V74LASTAdv0

+V74HIGHAdv1

+V74LASTAdv12

+V74LASTAdv13

+V88LASTAdv1

+V88HIGHAdv12 = 0.9908 Artificial Neural Network nnet package to fit a feed-forward neural network with a single hidden layer

Using the covariates from the previous logisitic regression model, we get... AUC = 0.9898 AUC Missing Data Many features had missing data

Solutions? Impute it

impute package (i.e. knn.impute)

Ignore it

Remove it I liked , will I like ? Cole Harris (Team DejaVu)'s Winning Solution True Crowdsourcing Predict the likelihood that an HIV patient's infection will become less severe, given a small dataset and limited clinical information. Chris Raimondi caret package for recursive feature elimination - rfe( )

RandomForest to fit model by Arvind Narayanan, Elaine Shi, Ben Rubinstein, and Yong J Kil Full description: http://blog.kaggle.com/2011/01/15/how-we-did-it-the-winners-of-the-ijcnn-social-network-challenge/ Predicting tourism related time series Miscellaneous

### Present Remotely

Send the link below via email or IM

CopyPresent to your audience

Start remote presentation- Invited audience members
**will follow you**as you navigate and present - People invited to a presentation
**do not need a Prezi account** - This link expires
**10 minutes**after you close the presentation - A maximum of
**30 users**can follow your presentation - Learn more about this feature in our knowledge base article

# Using R for Data Mining Competitions

An GTA R Users Group talk

by

Tweet