Stepwise Backwards Regression
Conclusion
- For the given data set the best model is the model without any variable removal (although this depends on how many folds you use)
- Not checking every possible combination of explanatory variables makes the variable selection computationally feasible
- For use on the other data sets NA values should be taken into account
- For future versions, the main function (model.maker) could be more robust by being able to manipulate different forms of the initial data into the desired matrix form
Nazym Satbekova
Hannah Worrall
Emily Wright
cross.validated.mse<-function(x,y,folds){
reduced.data<-cbind(y,x)
reordered.data<-reduced.data[sample(nrow(reduced.data)),] # scrambling data
reordered.data<-as.matrix(reordered.data) # changed to matrix for matrix multiplication
n<-length(y)
n.variables<-length(reordered.data[1,-1])
p<-n.variables+1
mse<-c()
partitions<-list()
first.index<-1
for(i in 1:folds){ # making the partitions
second.index<-floor(i*n/folds)
partitions[[i]]<-reduced.data[first.index:second.index,]
first.index<-second.index+1
}
Cross-Validated Code
first.index<-1
for (i in 1:folds){ # finding mse for each partition
y.i<-as.matrix(partitions[[i]][,1])
x.i<-as.matrix(partitions[[i]][,-1])
line<-lm(y.i~x.i)
coefficients<-coefficients(line)
second.index<-floor(i*n/folds)
test.data<-reordered.data[-first.index:-second.index,] # removing fold and y's
y.test.data<-as.matrix(reordered.data[,1])
x.test.data<-as.matrix(reordered.data[,-1])
first.index<-second.index+1
predicted<-y.i
for(j in 1:length(x.test.data[,1])){
predicted[j]<-coefficients[1]+coefficients[2:(n.variables+1)]%*%x.test.data[j,]
}
sse<-sum((y.test.data[,1]-predicted)^2)
mse[i]<-sse/p
}
return(mean(mse))
}
Cross-Validated Code
Cross-Validation Technique
- Data is shuffled and partitioned into k folds (Default number of folds = 2)
- A linear model is run on each subgroup and fitted to the other subgroups, and resulting MSE is calculated
- The resulted output of the function is the average of the MSE's
Model.Maker(x,y, folds = 2)
Approach
[[4]]
Residuals:
Min 1Q Median 3Q Max
-563013 -43592 -11327 30307 803996
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.594e+06 6.254e+04 -57.468 < 2e-16 ***
final.data1 4.025e+04 3.351e+02 120.123 < 2e-16 ***
final.data2 1.156e+03 4.317e+01 26.787 < 2e-16 ***
final.data3 -8.182e+00 7.881e-01 -10.381 < 2e-16 ***
final.data4 1.134e+02 6.902e+00 16.432 < 2e-16 ***
final.data5 -3.854e+01 1.079e+00 -35.716 < 2e-16 ***
final.data6 4.831e+01 7.515e+00 6.429 1.32e-10 ***
final.data7 -4.258e+04 6.733e+02 -63.240 < 2e-16 ***
final.data8 -4.282e+04 7.130e+02 -60.061 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- Stepwise backwards regression based on a greedy approach
- Each explanatory variable is removed from the model one at a time
- The least powerful variable is the one that produces the smallest MSE change when removed
- P+1 models are produced where P = number of explanatory variables
- The best model is determined by the lowest MSE based on cross validation
Model.maker(x,y, folds = 2)
[[1]]
[1] 1.163158e+13 1.311605e+13 1.509916e+13 1.839752e+13 2.355850e+13
[6] 2.954402e+13 4.887981e+13 7.250439e+13 1.374093e+14
[[2]]
[1] "MedianIncome" "MedianHouseAge" "TotalRooms" "TotalBedrooms"
[5] "Population" "Households" "Latitude" "Longitude"
[[3]]
[1] 1.163158e+13
Model.maker Output
- The final output of the model.maker function is a list with 4 components
- 1st list element: list of cross-validated MSE's for each possible model
- 2nd list element: list of explanatory variables in the final model
- 3rd list element: the lowest MSE of the models
- 4rd list element: summary output for the regression based on the selected model
model.maker<-function(x,y,folds=2){
x<-as.matrix(x)
models<-generate.models(x,y)
best.model<-lowest.mse.model(models,y,folds)
return (best.model)
}
Putting It All Together
The Data: Predicting Median House Value
- 8 explanatory variables
- Each explanatory variable is highly correlated with the response variable
- Running the simple linear fit on the data produces p-values less than or equal 1.32 x 10^-10 for each variable
- All the variables are highly significant
Overview of Problem
Linear Regression with many Explanatory Variables
- Using every variable is often non-optimal
- Deciding which to use can be a time-consuming process
- How do we decide which model is best without checking every possible combination on the variables?