Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript
  • PackageName The name of current R package
  • DependencyCount Number of other packages that have it as a dependency
  • SuggestionCount Number of other packages that suggest it
  • ViewsIncluding Number of views that include it
  • CorePackage Whether or not it is a core package (binary)
  • RecommendedPackage Whether or not it is a recommended package (binary)
  • Maintainer Name and e-mail of package maintainer
  • PackagesMaintaining How many other packages the maintainer maintains

## Fit the model to the training data

logit.fit <- glm(Installed ~ LogDependencyCount +

LogSuggestionCount +

LogImportCount +

LogViewsIncluding +

LogPackagesMaintaining +

CorePackage +

RecommendedPackage +

factor(User),

data = train.data,

family = binomial(link = 'logit'))

## Use model to predict on the test set

logit.predict <- predict.glm(logit.fit,newdata=test.data,type='response')

## Fit model on training data

logit.fit <- glm(Installed ~ (LogDependencyCount +

LogSuggestionCount +

LogImportCount +

ViewsIncluding)^2 +

factor(User) -

LogSuggestionCount:ViewsIncluding -

LogImportCount:ViewsIncluding,

data = train.data,

family = binomial(link = 'logit'))

#### Predicting the test data

logit.predict <- predict.glm(logit.fit, newdata=test.data, type='response')

for (i in 1:test.size) {

a <- test$V1[i]

b <- test$V2[i]

# Get all nodes that connect to a

a.list <- c(train$V1[which(train$V2==a)],train$V2[which(train$V1==a)])

# Get all nodes that connect to b

b.list <- c(train$V1[which(train$V2==b)],train$V2[which(train$V1==b)])

# Find number of mutual nodes

mutual[i] <- sum(a.list %in% b.list)

}

library(igraph)

edges <- matrix(c(train$V1,train$V2),ncol=2,byrow=FALSE)

sngraph <- graph.edgelist(edges,directed=TRUE)

get.path.lengths <- function(v1,v2,graph) length(get.shortest.paths(sngraph,v1,v2,mode="all")[[1]])-1

path.lengths <- mapply(get.path.lengths,test$V1,test$V2,MoreArgs=list(graph=sngraph))

Why Data Mining Competitions?

Why R?

So Where Are They?

An incomplete list...

Data Mining

Competitions

Example: Netflix

Big Data

Netflix data - a matrix of about 480,000 members' ratings for about 18,000 movies) was about 65 GB!

Matrix package using sparseMatrix( ) function efficiently stores the data in ~800 MB

irlba package can be used to efficiently do SVD and PCA for large sparse matricies

Individual

Competitions

There's a package for that

Parallel R

Revolution R

  • Speed up computation
  • Bypass memory limitations

  • Quite easy for independent simulations

  • Not so simple for non-independent simulations
  • Rmpi
  • snow
  • multicore
  • parallel (default in version 2.14)
  • Commercial software

built on top of R

  • Optimized for Multi-processor

workstations

  • Multi-threaded Math libraries
  • Terabyte file structures and

optimized algorithms for "Big Data"

  • Free for academic users

Competition

Platforms

More info: http://www.revolutionanalytics.com/products/enterprise-big-data.php

http://cran.r-project.org/web/views/HighPerformanceComputing.html

Check out the MachineLearning Task View on CRAN

Machine Learning

ahaz

arules

BayesTree

Boruta

BPHO

caret

class

CORElearn

CoxBoost

Cubist

e1071

earth

elasticnet

Benefits

Model Evaluation

ElemStatLearn

gafit

GAMBoost

gbm

glmnet

glmpath

grplasso

hda

igraph

ipred

kernlab

klaR

http://cran.r-project.org/web/views/MachineLearning.html

  • Most packages have built in model evaluation tools
  • ROCR: powerful and easy-to-use package for visualizing classifier performance

lars

lasso2

lda

LiblineaR

LogicForest

LogicReg

ltr

mboost

mvpart

ncvreg

nnet

pamr

Supported measures

Accuracy, error rate, true positive rate, false positive rate, true negative rate, false negative rate, sensitivity, specificity, recall, positive predictive value, negative predictive value, precision, fallout, miss, phi correlation coefficient, Matthews correlation coefficient, mutual information, chi square statistic, odds ratio, lift value, precision/recall F measure, ROC convex hull, area under the ROC curve, precision/recall break-even point, calibration error, mean cross-entropy, root mean squared error, SAR measure, expected cost, explicit cost.

party

penalized

penalizedSVM

predbayescor

quantregForest

randomForest

randomSurvivalForest

rda

rdetools

REEMtree

relaxo

rgenoud

Features

rgp

rminer

ROCR

rpart

RSNNS

RWeka

sda

SDDA

svmpath

tgp

tree

TWIX

varSelRF

ROC curves, precision/recall plots, lift charts, cost curves, custom curves by freely selecting one performance measure for the x axis and one for the y axis, handling of data from cross-validation or bootstrapping, curve averaging (vertically, horizontally, or by threshold), standard error bars, box plots, curves that are color-coded by cutoff, printing threshold values on the curve, tight integration with Rs plotting facilities (making it easy to adjust plots or to combine multiple plots), fully customizable, easy to use (only 3 commands).

http://rocr.bioinf.mpi-sb.mpg.de/

Great for Companies

  • Low-cost way to get innovative solutions
  • Crowd sourcing
  • Ownership of solution

lattice

Some examples from R graph gallery: http://addictedtor.free.fr/graphiques/

by Deepayan Sarkar

Data Visualization

Great for You

require(lattice)

histogram( ~ height | voice.part, data = singer,

xlab = "Height (inches)", type = "density",

panel = function(x, ...) {

panel.histogram(x, ...)

panel.mathdensity(dmath = dnorm, col = "black",

args = list(mean=mean(x),sd=sd(x)))

} )

ggplot2

The Usual Suspects

rgl

by Hadley Wickham

by Daniel Adler and Duncan Murdoch

require(ggplot2)

ggplot(diamonds, aes(depth, fill = cut))

+ geom_density(alpha = 0.2)

+ xlim(55, 70)

A Bit of History

anaglyph

RnavGraph

by Jonathan Lee

require(anaglyph)

attach(mtcars)

anaglyph.plot(x=wt,y=mpg,z=disp,

main="Weight vs Miles Per Gallon vs Displacement",

xlab="Weight (lb/1000lb)",

ylab="Miles/US Gallon")

by Adrian Waddell

  • Ready-to-use data
  • Structured goal
  • No accountability, no stress
  • Recognition and prizes ($$)
  • Publication opportunity
  • Fun!

wordcloud

by Ian Fellows

require(tm)

require(wordcloud)

data(crude)

crude <- tm_map(crude, removePunctuation)

crude <- tm_map(crude, function(x)removeWords(x,stopwords()))

tdm <- TermDocumentMatrix(crude)

v <- sort(rowSums(as.matrix(tdm)),decreasing=TRUE)

d <- data.frame(word = names(v),freq=v)

wordcloud(d$word,d$freq)

  • open source
  • cross-platform
  • powerful
  • de-facto standard

The Prize

Not the first data mining competition, but probably the most well-known.

The Journey

July 27, 2009

June 26, 2009

BellKor's Pragmatic Chaos reaches 10.05% improvement.

"Last call" for final submissions.

2010

2009

October 15, 2007

1% improvement reached.

2008

September 21, 2009

October 8, 2007

BellKor's Pragmatic Chaos wins grand prize

($1,000,000)

Cinematch performance surpassed.

2007

The Players

December 10, 2008

2006

BellKor in Big Chaos awarded $50,000 progress prize

9.44% improvement

Description of algorithm published

November 13, 2007

BellKor awarded $50,000 progress prize 8.43% improvement

Description of algorithm published

October 2, 2007

Netflix Prize contest announced.

  • 51051 contestants on 41305 teams from 186 countries.
  • Anyone could join

  • Computer scientists, statisticians, programmers etc.

The Data

  • At end of competition, models with at least 10% improvement are tested on test set.
  • Best performing model on test set is declared the winner.

Test set (~1 million entries)

Quiz set (~1 million entries)

  • Model tested on quiz set to determine leaderboard ranking
  • Competition deadline set once 10% improvement is reached

The Goal

  • All data including answers are provided
  • Models are fit onto this data

Training Set

(~100 million entries)

  • Do better than Netflix's existing system, Cinematch, by at least 10%
  • Cinematch uses simple linear models with data conditioning

  • Grand prize: $1 million USD

The Problem

  • Netflix is an online movie rental service
  • Can we predict a Netflix user's rating for a movie given previous movie rating data?

I liked , will I like ?

Isn't (insert language here) Faster?

Run-time? Perhaps.

Development time? Probably not!

Packages to note: compiler, Rcpp

Using for

Data Mining Competitions

Jonathan Lee, University of Western Ontario

GTA R Users Group

November 25, 2011

Ryerson University

from

Additional Resources

Case Studies

Data Mining is a Team Sport

How can we collaborate?

Code hosting (with version control)

Github http://www.github.com

SourceForge http://www.sourceforge.net

R-forge http://r-forge.r-project.org

RGoogleDocs package

Data Mining and

Machine Learning

  • The Elements of Statistical Learning

http://www-stat.stanford.edu/~tibs/ElemStatLearn/

  • Stanford's Machine Learning Course

http://www.ml-class.org

R Package Recommendation Engine

A better model

An even better model

  • Add in automatic stepwise model selection...

  • AUC on test set: 0.9453
  • If a package is installed, assume all dependencies and imports are installed
  • Assume all core packages are installed
  • Synthesize better predictor variables

  • Add a little bit of intuition

  • AUC: 0.9682

A winning model

AUC

Max Lin's 2nd place solution: Ensemble Learning with 4 classifiers

=0.9833

+ intuition

Source: https://github.com/m4xl1n/r_recommendation_system

x 10-folds

Fit ensemble model using logistic regression on test data (ROCR)

4

logistic regression

(stats)

logistic regression (stats)

Train classifiers with training data

latent factor model (ltr)

2

logistic regression (stats)

latent Dirichlet allocation on topic (lda)

latent factor model (ltr)

Use classifiers to predict test data

latent Dirichlet allocation on task view (lda)

latent Dirichlet allocation on topic (lda)

3

Training Data (80%)

latent Dirichlet allocation on task view (lda)

Take training data and create an out-of-sample test set

Test (10%)

10

A simple model

1

  • Logistic regression (stats package) on predictor variables with some variable transformation

  • AUC on test set: 0.8178

The Heritage Health Prize

The Problem

  • There are so many packages that exist for R
  • Impossible to know all of them -- so which ones should one invest time in learning?
  • Cinematch for R

The Data

The Prize

Negative

  • Voluntary sample of 52 R users
  • ~100,000 rows of data for 1865 packages
  • Each row contains metadata of package and whether or not a user has that package installed.

Actual

False Positive

Positive

True Negative

True Positive

AUC: Area Under the (ROC) Curve

False Negative

Positive

Negative

Predicted

  • For a binary classification problem, we can get a confusion matrix.

  • Receiver Operator Characteristic (ROC) space is defined as the True Positive Rate vs the False Positive Rate.

  • Our predicted values are not binary, but rather a probability between 0 and 1.
  • By evaluating the ROC at every classification threshold, we get an ROC curve.

Test set (1/3)

  • Any 3 books from the UseR! book list (postage included)
  • Winner is determined by highest AUC value on test set

Training set (2/3)

  • Predict the number of days a patient will spend in the hospital in the next year given historical claims data.

  • $3M grand prize + $240k progress
  • Ends April 2013

HIV Viral Load Prediction

True Crowdsourcing

Chris Raimondi

Winning Algorithm

Artificial Neural Network

The Problem

AUC

  • nnet package to fit a feed-forward neural network with a single hidden layer
  • Using the covariates from the previous logisitic regression model, we get...

= 0.9908

library(caret)

model.1 <- rfe(hiv.412[,c(5:617)], as.factor(hiv.412[,618]),

sizes = c(1:10, 15, 30, 60, 90, 120, 150, 200, 250, 300, 350),

rfeControl = rfeControl(functions = rfFuncs , method="cv"))

model.1.prediction <- predict(model.1$fit, hiv.692)

  • caret package for recursive feature elimination - rfe( )

  • RandomForest to fit model

Output Layer

Single Hidden Layer NN

Hidden Layer

Input Layer

V74CA0

V74HA1

V74CD12

V74LA13

V88CA1

V88H12A12

Predict the likelihood that an HIV patient's infection will become less severe, given a small dataset and limited clinical information.

library(RandomForest)

fin.train.1e.5050 <- randomForest(as.factor(Resp) ~ VL.t0 + rt184 + CD4.t0 + rt215 +

rt98 + rt225 + rt35 + rt151 + rt207 + rt227 + pr12 + pr70 +

rt101 + pr71 + rt69 + rt75 + rt187 + rt203 + rt41 + pr60 +

pr20 + rt43 + rt179 + pr82 + pr46 + pr90 + rt31 + pr30 +

pr10 + rt44 + pr58 + rt121 + pr73 + rt195 + pr63 + rt109 +

pr54 + rt173 + pr88 + rt205 + rt190 + rt77 + pr13 + rt115 +

rt106 + pr32 + pr14 + rt170 + rt4 + pr92 + pr35,

na.action=na.omit, data=yellow.5050, mtry=15)

INFORMS 2011

Model Selection

AUC

= 0.9898

  • Manual model selection results in logistic regression model of the form...

Cole Harris (Team DejaVu)'s Winning Solution

Target ~ V74LASTAdv0

+V74HIGHAdv1

+V74LASTAdv12

+V74LASTAdv13

+V88LASTAdv1

+V88HIGHAdv12

Miscellaneous

Logisitic Regression

  • V74 and V88
  • Each variable has 4 prices: Open, High, Low, Last

  • Take lag 0, 1, 12, 13
  • Feature selection done to select top two significant variables at lag 12

  • Further significance noticed at lag 1

  • 2 variables * 4 prices * 4 lags = 32 covariates

Missing Data

  • Many features had missing data
  • Solutions?
  • Impute it

impute package (i.e. knn.impute)

  • Ignore it
  • Remove it

The Problem

  • Predict stock price movements at 5-minute intervals
  • Useful for high-frequency traders
  • Performance measured by AUC

The Data

  • Time series with 609 blind explanatory variables for an unknown stock

Training set: 5922 observations taken at 5-minute intervals.

Test set: 2539 observations taken at 5-minute intervals.

  • stock prices
  • sectoral data
  • economic data
  • experts' predictions
  • indexes
  • Prezi (this presentation software)

http://prezi.com

  • R-bloggers

http://www.r-bloggers.com

  • Kaggle blog: No Free Hunch

http://blog.kaggle.com/

Tourism Forecasting

Outbound nodes

29

Inbound and Outbound nodes

37,689

Predicting tourism related time series

The Data

Inbound nodes

1,133,518

  • 1,133,547 nodes (users)

  • 7,246,943 edges (contacts)

  • No additional covariates given

  • What social network generates this type of graph?
  • Directed graph

  • 0.1 % used as test set

IJCNN Social Network Challenge

My Contact

Two Degrees of Separation

  • Number of mutual friends

  • p=1 if within two degrees of separation
  • p=0 otherwise
  • AUC on test set: 0.7529

by Arvind Narayanan, Elaine Shi, Ben Rubinstein, and Yong J Kil

A winning model

Going Beyond the Data

  • Realized the dataset came from
  • De-anonymized the users (not an easy task!)
  • Recreated the missing edges (60%)
  • Used randomForests on standard social network features to fill in the rest

  • So why not 100% accuracy?
  • Social networks evolve over time.

The Problem

  • AUC of 0.9812!
  • How did they do it?

Full description: http://blog.kaggle.com/2011/01/15/how-we-did-it-the-winners-of-the-ijcnn-social-network-challenge/

  • Predicting edges in an online social network
  • If you and I are both connected to Bob, what is the probability (p) that I am connected to you?

n Degrees of Separation

  • Can't feasibly find 3+ degrees of separation
  • Let's put this into a graph (igraph package) and find the degrees of separation

  • Inverse function of degrees of separation?

  • AUC on test set: 0.832

The Prize

  • Free registration for two to the IJCNN 2011 Conference and chance to present your solution. ($950 value)
  • Winner is determined by highest AUC value on test set

More covariates

  • Number of friends
  • Shortest path
  • Number of shortest paths etc.
  • Having a graph opens up a lot of possible covariates

  • After logistic regression...
  • AUC on test set: 0.8704

Jonathan Lee

E-mail: jlee253@uwo.ca

Website: http://www.stats.uwo.ca/gradwebs/jlee/

This presentation and slides will be made available on my blog: http://www.compmath.com

Learn more about creating dynamic, engaging presentations with Prezi