Introducing
Your new presentation assistant.
Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.
Trending searches
## Fit the model to the training data
logit.fit <- glm(Installed ~ LogDependencyCount +
LogSuggestionCount +
LogImportCount +
LogViewsIncluding +
LogPackagesMaintaining +
CorePackage +
RecommendedPackage +
factor(User),
data = train.data,
family = binomial(link = 'logit'))
## Use model to predict on the test set
logit.predict <- predict.glm(logit.fit,newdata=test.data,type='response')
## Fit model on training data
logit.fit <- glm(Installed ~ (LogDependencyCount +
LogSuggestionCount +
LogImportCount +
ViewsIncluding)^2 +
factor(User) -
LogSuggestionCount:ViewsIncluding -
LogImportCount:ViewsIncluding,
data = train.data,
family = binomial(link = 'logit'))
#### Predicting the test data
logit.predict <- predict.glm(logit.fit, newdata=test.data, type='response')
for (i in 1:test.size) {
a <- test$V1[i]
b <- test$V2[i]
# Get all nodes that connect to a
a.list <- c(train$V1[which(train$V2==a)],train$V2[which(train$V1==a)])
# Get all nodes that connect to b
b.list <- c(train$V1[which(train$V2==b)],train$V2[which(train$V1==b)])
# Find number of mutual nodes
mutual[i] <- sum(a.list %in% b.list)
}
library(igraph)
edges <- matrix(c(train$V1,train$V2),ncol=2,byrow=FALSE)
sngraph <- graph.edgelist(edges,directed=TRUE)
get.path.lengths <- function(v1,v2,graph) length(get.shortest.paths(sngraph,v1,v2,mode="all")[[1]])-1
path.lengths <- mapply(get.path.lengths,test$V1,test$V2,MoreArgs=list(graph=sngraph))
Netflix data - a matrix of about 480,000 members' ratings for about 18,000 movies) was about 65 GB!
Matrix package using sparseMatrix( ) function efficiently stores the data in ~800 MB
irlba package can be used to efficiently do SVD and PCA for large sparse matricies
built on top of R
workstations
optimized algorithms for "Big Data"
More info: http://www.revolutionanalytics.com/products/enterprise-big-data.php
http://cran.r-project.org/web/views/HighPerformanceComputing.html
Check out the MachineLearning Task View on CRAN
ahaz
arules
BayesTree
Boruta
BPHO
caret
class
CORElearn
CoxBoost
Cubist
e1071
earth
elasticnet
ElemStatLearn
gafit
GAMBoost
gbm
glmnet
glmpath
grplasso
hda
igraph
ipred
kernlab
klaR
http://cran.r-project.org/web/views/MachineLearning.html
lars
lasso2
lda
LiblineaR
LogicForest
LogicReg
ltr
mboost
mvpart
ncvreg
nnet
pamr
Supported measures
Accuracy, error rate, true positive rate, false positive rate, true negative rate, false negative rate, sensitivity, specificity, recall, positive predictive value, negative predictive value, precision, fallout, miss, phi correlation coefficient, Matthews correlation coefficient, mutual information, chi square statistic, odds ratio, lift value, precision/recall F measure, ROC convex hull, area under the ROC curve, precision/recall break-even point, calibration error, mean cross-entropy, root mean squared error, SAR measure, expected cost, explicit cost.
party
penalized
penalizedSVM
predbayescor
quantregForest
randomForest
randomSurvivalForest
rda
rdetools
REEMtree
relaxo
rgenoud
Features
rgp
rminer
ROCR
rpart
RSNNS
RWeka
sda
SDDA
svmpath
tgp
tree
TWIX
varSelRF
ROC curves, precision/recall plots, lift charts, cost curves, custom curves by freely selecting one performance measure for the x axis and one for the y axis, handling of data from cross-validation or bootstrapping, curve averaging (vertically, horizontally, or by threshold), standard error bars, box plots, curves that are color-coded by cutoff, printing threshold values on the curve, tight integration with Rs plotting facilities (making it easy to adjust plots or to combine multiple plots), fully customizable, easy to use (only 3 commands).
http://rocr.bioinf.mpi-sb.mpg.de/
lattice
Some examples from R graph gallery: http://addictedtor.free.fr/graphiques/
by Deepayan Sarkar
require(lattice)
histogram( ~ height | voice.part, data = singer,
xlab = "Height (inches)", type = "density",
panel = function(x, ...) {
panel.histogram(x, ...)
panel.mathdensity(dmath = dnorm, col = "black",
args = list(mean=mean(x),sd=sd(x)))
} )
ggplot2
rgl
by Hadley Wickham
by Daniel Adler and Duncan Murdoch
require(ggplot2)
ggplot(diamonds, aes(depth, fill = cut))
+ geom_density(alpha = 0.2)
+ xlim(55, 70)
anaglyph
RnavGraph
by Jonathan Lee
require(anaglyph)
attach(mtcars)
anaglyph.plot(x=wt,y=mpg,z=disp,
main="Weight vs Miles Per Gallon vs Displacement",
xlab="Weight (lb/1000lb)",
ylab="Miles/US Gallon")
by Adrian Waddell
wordcloud
by Ian Fellows
require(tm)
require(wordcloud)
data(crude)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, function(x)removeWords(x,stopwords()))
tdm <- TermDocumentMatrix(crude)
v <- sort(rowSums(as.matrix(tdm)),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
wordcloud(d$word,d$freq)
Not the first data mining competition, but probably the most well-known.
BellKor's Pragmatic Chaos reaches 10.05% improvement.
"Last call" for final submissions.
1% improvement reached.
BellKor's Pragmatic Chaos wins grand prize
($1,000,000)
Cinematch performance surpassed.
BellKor in Big Chaos awarded $50,000 progress prize
9.44% improvement
Description of algorithm published
BellKor awarded $50,000 progress prize 8.43% improvement
Description of algorithm published
Netflix Prize contest announced.
Test set (~1 million entries)
Quiz set (~1 million entries)
Training Set
(~100 million entries)
I liked , will I like ?
Run-time? Perhaps.
Development time? Probably not!
Packages to note: compiler, Rcpp
How can we collaborate?
Code hosting (with version control)
Github http://www.github.com
SourceForge http://www.sourceforge.net
R-forge http://r-forge.r-project.org
RGoogleDocs package
http://www-stat.stanford.edu/~tibs/ElemStatLearn/
http://www.ml-class.org
AUC
Max Lin's 2nd place solution: Ensemble Learning with 4 classifiers
=0.9833
+ intuition
Source: https://github.com/m4xl1n/r_recommendation_system
x 10-folds
Fit ensemble model using logistic regression on test data (ROCR)
4
logistic regression
(stats)
logistic regression (stats)
Train classifiers with training data
latent factor model (ltr)
2
logistic regression (stats)
latent Dirichlet allocation on topic (lda)
latent factor model (ltr)
Use classifiers to predict test data
latent Dirichlet allocation on task view (lda)
latent Dirichlet allocation on topic (lda)
3
Training Data (80%)
latent Dirichlet allocation on task view (lda)
Take training data and create an out-of-sample test set
Test (10%)
10
1
False Positive
True Negative
True Positive
False Negative
Test set (1/3)
Training set (2/3)
Chris Raimondi
AUC
= 0.9908
library(caret)
model.1 <- rfe(hiv.412[,c(5:617)], as.factor(hiv.412[,618]),
sizes = c(1:10, 15, 30, 60, 90, 120, 150, 200, 250, 300, 350),
rfeControl = rfeControl(functions = rfFuncs , method="cv"))
model.1.prediction <- predict(model.1$fit, hiv.692)
Output Layer
Hidden Layer
Input Layer
V74CA0
V74HA1
V74CD12
V74LA13
V88CA1
V88H12A12
Predict the likelihood that an HIV patient's infection will become less severe, given a small dataset and limited clinical information.
library(RandomForest)
fin.train.1e.5050 <- randomForest(as.factor(Resp) ~ VL.t0 + rt184 + CD4.t0 + rt215 +
rt98 + rt225 + rt35 + rt151 + rt207 + rt227 + pr12 + pr70 +
rt101 + pr71 + rt69 + rt75 + rt187 + rt203 + rt41 + pr60 +
pr20 + rt43 + rt179 + pr82 + pr46 + pr90 + rt31 + pr30 +
pr10 + rt44 + pr58 + rt121 + pr73 + rt195 + pr63 + rt109 +
pr54 + rt173 + pr88 + rt205 + rt190 + rt77 + pr13 + rt115 +
rt106 + pr32 + pr14 + rt170 + rt4 + pr92 + pr35,
na.action=na.omit, data=yellow.5050, mtry=15)
AUC
= 0.9898
Cole Harris (Team DejaVu)'s Winning Solution
Target ~ V74LASTAdv0
+V74HIGHAdv1
+V74LASTAdv12
+V74LASTAdv13
+V88LASTAdv1
+V88HIGHAdv12
impute package (i.e. knn.impute)
Training set: 5922 observations taken at 5-minute intervals.
Test set: 2539 observations taken at 5-minute intervals.
http://prezi.com
http://www.r-bloggers.com
http://blog.kaggle.com/
Outbound nodes
29
Inbound and Outbound nodes
37,689
Predicting tourism related time series
Inbound nodes
1,133,518
by Arvind Narayanan, Elaine Shi, Ben Rubinstein, and Yong J Kil
Full description: http://blog.kaggle.com/2011/01/15/how-we-did-it-the-winners-of-the-ijcnn-social-network-challenge/
Jonathan Lee
E-mail: jlee253@uwo.ca
Website: http://www.stats.uwo.ca/gradwebs/jlee/
This presentation and slides will be made available on my blog: http://www.compmath.com