Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

Stepwise Backwards Regression

Conclusion

  • For the given data set the best model is the model without any variable removal (although this depends on how many folds you use)
  • Not checking every possible combination of explanatory variables makes the variable selection computationally feasible
  • For use on the other data sets NA values should be taken into account
  • For future versions, the main function (model.maker) could be more robust by being able to manipulate different forms of the initial data into the desired matrix form

Nazym Satbekova

Hannah Worrall

Emily Wright

cross.validated.mse<-function(x,y,folds){

reduced.data<-cbind(y,x)

reordered.data<-reduced.data[sample(nrow(reduced.data)),] # scrambling data

reordered.data<-as.matrix(reordered.data) # changed to matrix for matrix multiplication

n<-length(y)

n.variables<-length(reordered.data[1,-1])

p<-n.variables+1

mse<-c()

partitions<-list()

first.index<-1

for(i in 1:folds){ # making the partitions

second.index<-floor(i*n/folds)

partitions[[i]]<-reduced.data[first.index:second.index,]

first.index<-second.index+1

}

Cross-Validated Code

first.index<-1

for (i in 1:folds){ # finding mse for each partition

y.i<-as.matrix(partitions[[i]][,1])

x.i<-as.matrix(partitions[[i]][,-1])

line<-lm(y.i~x.i)

coefficients<-coefficients(line)

second.index<-floor(i*n/folds)

test.data<-reordered.data[-first.index:-second.index,] # removing fold and y's

y.test.data<-as.matrix(reordered.data[,1])

x.test.data<-as.matrix(reordered.data[,-1])

first.index<-second.index+1

predicted<-y.i

for(j in 1:length(x.test.data[,1])){

predicted[j]<-coefficients[1]+coefficients[2:(n.variables+1)]%*%x.test.data[j,]

}

sse<-sum((y.test.data[,1]-predicted)^2)

mse[i]<-sse/p

}

return(mean(mse))

}

Cross-Validated Code

Cross-Validation Technique

  • Data is shuffled and partitioned into k folds (Default number of folds = 2)
  • A linear model is run on each subgroup and fitted to the other subgroups, and resulting MSE is calculated
  • The resulted output of the function is the average of the MSE's

Model.Maker(x,y, folds = 2)

Approach

[[4]]

Residuals:

Min 1Q Median 3Q Max

-563013 -43592 -11327 30307 803996

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -3.594e+06 6.254e+04 -57.468 < 2e-16 ***

final.data1 4.025e+04 3.351e+02 120.123 < 2e-16 ***

final.data2 1.156e+03 4.317e+01 26.787 < 2e-16 ***

final.data3 -8.182e+00 7.881e-01 -10.381 < 2e-16 ***

final.data4 1.134e+02 6.902e+00 16.432 < 2e-16 ***

final.data5 -3.854e+01 1.079e+00 -35.716 < 2e-16 ***

final.data6 4.831e+01 7.515e+00 6.429 1.32e-10 ***

final.data7 -4.258e+04 6.733e+02 -63.240 < 2e-16 ***

final.data8 -4.282e+04 7.130e+02 -60.061 < 2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

  • Stepwise backwards regression based on a greedy approach
  • Each explanatory variable is removed from the model one at a time
  • The least powerful variable is the one that produces the smallest MSE change when removed
  • P+1 models are produced where P = number of explanatory variables
  • The best model is determined by the lowest MSE based on cross validation

Model.maker(x,y, folds = 2)

[[1]]

[1] 1.163158e+13 1.311605e+13 1.509916e+13 1.839752e+13 2.355850e+13

[6] 2.954402e+13 4.887981e+13 7.250439e+13 1.374093e+14

[[2]]

[1] "MedianIncome" "MedianHouseAge" "TotalRooms" "TotalBedrooms"

[5] "Population" "Households" "Latitude" "Longitude"

[[3]]

[1] 1.163158e+13

Model.maker Output

  • The final output of the model.maker function is a list with 4 components
  • 1st list element: list of cross-validated MSE's for each possible model
  • 2nd list element: list of explanatory variables in the final model
  • 3rd list element: the lowest MSE of the models
  • 4rd list element: summary output for the regression based on the selected model

model.maker<-function(x,y,folds=2){

x<-as.matrix(x)

models<-generate.models(x,y)

best.model<-lowest.mse.model(models,y,folds)

return (best.model)

}

Putting It All Together

The Data: Predicting Median House Value

  • 8 explanatory variables
  • Each explanatory variable is highly correlated with the response variable
  • Running the simple linear fit on the data produces p-values less than or equal 1.32 x 10^-10 for each variable
  • All the variables are highly significant

Overview of Problem

Linear Regression with many Explanatory Variables

  • Using every variable is often non-optimal
  • Deciding which to use can be a time-consuming process
  • How do we decide which model is best without checking every possible combination on the variables?
Learn more about creating dynamic, engaging presentations with Prezi