Introducing

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Backwards regression

Nazym Satbekova

Updated Dec. 14, 2012

Transcript

Stepwise Backwards Regression

Conclusion

For the given data set the best model is the model without any variable removal (although this depends on how many folds you use)
Not checking every possible combination of explanatory variables makes the variable selection computationally feasible
For use on the other data sets NA values should be taken into account
For future versions, the main function (model.maker) could be more robust by being able to manipulate different forms of the initial data into the desired matrix form

Nazym Satbekova

Hannah Worrall

Emily Wright

cross.validated.mse<-function(x,y,folds){

reduced.data<-cbind(y,x)

reordered.data<-reduced.data[sample(nrow(reduced.data)),] # scrambling data

reordered.data<-as.matrix(reordered.data) # changed to matrix for matrix multiplication

n<-length(y)

n.variables<-length(reordered.data[1,-1])

p<-n.variables+1

mse<-c()

partitions<-list()

first.index<-1

for(i in 1:folds){ # making the partitions

second.index<-floor(i*n/folds)

partitions[[i]]<-reduced.data[first.index:second.index,]

first.index<-second.index+1

}

Cross-Validated Code

first.index<-1

for (i in 1:folds){ # finding mse for each partition

y.i<-as.matrix(partitions[[i]][,1])

x.i<-as.matrix(partitions[[i]][,-1])

line<-lm(y.i~x.i)

coefficients<-coefficients(line)

second.index<-floor(i*n/folds)

test.data<-reordered.data[-first.index:-second.index,] # removing fold and y's

y.test.data<-as.matrix(reordered.data[,1])

x.test.data<-as.matrix(reordered.data[,-1])

first.index<-second.index+1

predicted<-y.i

for(j in 1:length(x.test.data[,1])){

predicted[j]<-coefficients[1]+coefficients[2:(n.variables+1)]%*%x.test.data[j,]

}

sse<-sum((y.test.data[,1]-predicted)^2)

mse[i]<-sse/p

}

return(mean(mse))

}

Cross-Validated Code

Cross-Validation Technique

Data is shuffled and partitioned into k folds (Default number of folds = 2)
A linear model is run on each subgroup and fitted to the other subgroups, and resulting MSE is calculated
The resulted output of the function is the average of the MSE's

Model.Maker(x,y, folds = 2)

Approach

[[4]]

Residuals:

Min 1Q Median 3Q Max

-563013 -43592 -11327 30307 803996

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -3.594e+06 6.254e+04 -57.468 < 2e-16 ***

final.data1 4.025e+04 3.351e+02 120.123 < 2e-16 ***

final.data2 1.156e+03 4.317e+01 26.787 < 2e-16 ***

final.data3 -8.182e+00 7.881e-01 -10.381 < 2e-16 ***

final.data4 1.134e+02 6.902e+00 16.432 < 2e-16 ***

final.data5 -3.854e+01 1.079e+00 -35.716 < 2e-16 ***

final.data6 4.831e+01 7.515e+00 6.429 1.32e-10 ***

final.data7 -4.258e+04 6.733e+02 -63.240 < 2e-16 ***

final.data8 -4.282e+04 7.130e+02 -60.061 < 2e-16 ***

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Stepwise backwards regression based on a greedy approach
Each explanatory variable is removed from the model one at a time
The least powerful variable is the one that produces the smallest MSE change when removed
P+1 models are produced where P = number of explanatory variables
The best model is determined by the lowest MSE based on cross validation

Model.maker(x,y, folds = 2)

[[1]]

[1] 1.163158e+13 1.311605e+13 1.509916e+13 1.839752e+13 2.355850e+13

[6] 2.954402e+13 4.887981e+13 7.250439e+13 1.374093e+14

[[2]]

[1] "MedianIncome" "MedianHouseAge" "TotalRooms" "TotalBedrooms"

[5] "Population" "Households" "Latitude" "Longitude"

[[3]]

[1] 1.163158e+13

Model.maker Output

The final output of the model.maker function is a list with 4 components
1st list element: list of cross-validated MSE's for each possible model
2nd list element: list of explanatory variables in the final model
3rd list element: the lowest MSE of the models
4rd list element: summary output for the regression based on the selected model

model.maker<-function(x,y,folds=2){

x<-as.matrix(x)

models<-generate.models(x,y)

best.model<-lowest.mse.model(models,y,folds)

return (best.model)

}

Putting It All Together

The Data: Predicting Median House Value

8 explanatory variables
Each explanatory variable is highly correlated with the response variable
Running the simple linear fit on the data produces p-values less than or equal 1.32 x 10^-10 for each variable
All the variables are highly significant

Overview of Problem

Linear Regression with many Explanatory Variables

Using every variable is often non-optimal
Deciding which to use can be a time-consuming process
How do we decide which model is best without checking every possible combination on the variables?

Choose a template

Sheet Music (AI Assisted)

Elevate your presentations with our Sheet Music Prezi AI-assisted presentation template, seamlessly blending aesthetics and functionality for a harmonious visual experience.

Colorful Nature - Dark (AI Assisted)

A whimsical flower motif sets the fun tone for this Prezi AI-assisted presentation template. Just add your own text, images, videos, or other content to create a memorable and engaging presentation your audience will love. Like all Prezi templates, it’s easily customizable.

Hiking Journey (AI Assisted)

Elevate your presentations with our immersive Hiking Journey Prezi AI-assisted presentation template, meticulously crafted to showcase the beauty of your adventures, from scenic trails to breathtaking landscapes, providing a visually compelling experience for every outdoor enthusiast.

See more templates →

Presentations from around the world

DelgadoCastillo_GustavoDavid__M04S2AI3

Dave Caisleán

AUTOEFICACIA - APRENDER A EMPRENDER

Abigail Esperanza

Creative Report

Gabriele Roncoroni

See staff picks →

Learn more about creating dynamic, engaging presentations with Prezi

Why Prezi is better

Backwards regression

Nazym Satbekova

Stepwise Backwards Regression

Conclusion

Nazym Satbekova

Hannah Worrall

Emily Wright

cross.validated.mse<-function(x,y,folds){

reduced.data<-cbind(y,x)

reordered.data<-reduced.data[sample(nrow(reduced.data)),] # scrambling data

reordered.data<-as.matrix(reordered.data) # changed to matrix for matrix multiplication

n<-length(y)

n.variables<-length(reordered.data[1,-1])

p<-n.variables+1

mse<-c()

partitions<-list()

first.index<-1

for(i in 1:folds){ # making the partitions

second.index<-floor(i*n/folds)

partitions[[i]]<-reduced.data[first.index:second.index,]

first.index<-second.index+1

}

Cross-Validated Code

first.index<-1

for (i in 1:folds){ # finding mse for each partition

y.i<-as.matrix(partitions[[i]][,1])

x.i<-as.matrix(partitions[[i]][,-1])

line<-lm(y.i~x.i)

coefficients<-coefficients(line)

second.index<-floor(i*n/folds)

test.data<-reordered.data[-first.index:-second.index,] # removing fold and y's

y.test.data<-as.matrix(reordered.data[,1])

x.test.data<-as.matrix(reordered.data[,-1])

first.index<-second.index+1

predicted<-y.i

for(j in 1:length(x.test.data[,1])){

predicted[j]<-coefficients[1]+coefficients[2:(n.variables+1)]%*%x.test.data[j,]

}

sse<-sum((y.test.data[,1]-predicted)^2)

mse[i]<-sse/p

}

return(mean(mse))

}

Cross-Validated Code

Cross-Validation Technique

Model.Maker(x,y, folds = 2)

Approach

[[4]]

Residuals:

Min 1Q Median 3Q Max

-563013 -43592 -11327 30307 803996

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -3.594e+06 6.254e+04 -57.468 < 2e-16 ***

final.data1 4.025e+04 3.351e+02 120.123 < 2e-16 ***

final.data2 1.156e+03 4.317e+01 26.787 < 2e-16 ***

final.data3 -8.182e+00 7.881e-01 -10.381 < 2e-16 ***

final.data4 1.134e+02 6.902e+00 16.432 < 2e-16 ***

final.data5 -3.854e+01 1.079e+00 -35.716 < 2e-16 ***

final.data6 4.831e+01 7.515e+00 6.429 1.32e-10 ***

final.data7 -4.258e+04 6.733e+02 -63.240 < 2e-16 ***

final.data8 -4.282e+04 7.130e+02 -60.061 < 2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model.maker(x,y, folds = 2)

[[1]]

[1] 1.163158e+13 1.311605e+13 1.509916e+13 1.839752e+13 2.355850e+13

[6] 2.954402e+13 4.887981e+13 7.250439e+13 1.374093e+14

[[2]]

[1] "MedianIncome" "MedianHouseAge" "TotalRooms" "TotalBedrooms"

[5] "Population" "Households" "Latitude" "Longitude"

[[3]]

[1] 1.163158e+13

Model.maker Output

model.maker<-function(x,y,folds=2){

x<-as.matrix(x)

models<-generate.models(x,y)

best.model<-lowest.mse.model(models,y,folds)

return (best.model)

}

Putting It All Together

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1