Exploring Boston Housing Data
Prepared by:
Sarah Cummings, Sriram Yarlagadda, Haifa Alsunaid
Introduction
Dataset
- Boston city data from 1978
- Obtained from the UCI Machine Learning Repository
Our Research Questions
- How do the variables provided in the dataset affect the median value of homes in Boston towns?
- Which variables affect the median value of homes the most?
- We also formed several hypotheses about our independent variables and their relationship with the dependent.
Pre-processing
Normality
Multicollinearity:
Based on the histogram, residuals are normally distributed
- Residuals do not seem to be showing any signs of heteroscedasticity
- No major curvature in the residual plot
We used VIF values to detect with a threshold of 7. No such values were found
Outliers (MSE = 3.7)
Most (>95%) of points within +/- 2*MSE
Very few points are beyond +/-3*MSE
Transformation
Final Model
Key Take aways:
- Our model satisfies all the regression assumptions
- Also, the constructed model answers the research question of finding the most variables that significantly affect the median home values.
- Model selection using AIC criterion and stepwise regression
- All the terms in the final model are significant
- F-test is significant with p-value < 2.2e-16
- Adjusted R-squared of 0.87
Influential Points:
- |Studentized Deleted Residuals| > 3
- Hat Values > 0.5
- Cook’s Distance > 1