Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Transcript of REGRESSION
STATISTICS FOR BEGINNERS
REGRESSION - definition and conditions
Least Squared Method for Regressions
So far ...
How Accurate is my model?
How to make sense and report regression results
1 continuous dependent variable
1 or more predictor variables
Independent variables may be continuous or categorical (dummies), or mixed.
You may have the same participantes, different participants.
What is a Regression?
Is a mathematical function that has the objective of finding relationships between variables.
Proposition of mathematical linear models:
F(x)= a + bx
Whe we do a regression we calculate the line that best fit our points. So we can estimate the value of y based on x observation.
The most common method is the "Least Squared Methods"
The regression line is always the one that minimizes the distances between itself and all the observations
Lets Try ...
When we run a regression the interpretation is:
ANOVA: Ho: there is no difference between the model and the mean of the dependent variable
H1: The model adds information to the mean
R and R squared have the same interpretaion as in correlation. R2 is read as the percentage amount that is explained by the independent variables. It indicates the quality of fit from the points to the regression line.
B (constant) - if the independent variable was zero, than the dependent variable would assume the value of B (constant)
B (x) - when we add 1 point to x we can predict an increas of B in y.
What are the assumptions?
In Regressions the main assumption is post-model and is given by the observation of RESIDUAL DISTRIBUTION, except for the first assumptions that is having a linear initial plot.
Remember that by looking at the erros we re indirectly looking at all the variables considered in the model.
y= b0 + b1x + Error
Error = y - (b0+b1x)
Error = y - ^y^
Mean is our best predictor
Considers possibility of better predictors than the mean ... starting of regression.
The conclusions were following a certain maniputalion (by one predictor in two or more categories) one can (can't) assume that on average the results are sufficiently distant (or not) from our initial observations.
Our best estimate for the new value fall within a confidence interval
We want to observe how the differences can be observed for different levels of our input variable. Point by point: is there a better model than just the mean?
regression aims at discovering the model that explains the relation between the dependent and independent variables.
It is all the same as in simple regressions we just introduce multiple predictors
Normally one predictor is responsible only for a small part of the variation of the dependent variable.
We want to achieve a model with a strong predicting power.
We also want to control the effect of other variables, and test the reaction to a particular value
Y= a + bx + error
Y= (model) + error
Y = (a +b1x1+b2x2+b3x3+ ... bnxn)+ error
Back to our example...
Choosing a method of regression
There are several methods ad they are related to the order of entrance of each variable in the model.
Pre-defined on SPSS, assumes that the researcher has a reason, based on theory or previous knowledge, on the order of influence.
Normally used when we have two blocks of variables that we want to test - a block that previous research has tested, and some new variables we want to acess.
The decision is made based on mathemtical criteria, as the computer choses which variable to try, normally based on individual correlations. There are three main methods. Normally used for background work and not to publish as it is hardly ever reproducible.
Are there any influential cases ?
Can this model generalize to different samples?
1. Serious outliers
We recognise serious outliers when we standartize the residuals and there are residuals higher than 3; if more than 1% of the sample has outliers bigger than 2,5; if more than 5% as outliers bigger than 2.
Cook's distance smaller than 1 means the outliers are not influencing the model
: random, representative
: Big enough at least
15 cases for each predictor
- 3 predictors means at least 45 cases.
But ideally the rule should be 50+8k
(k is the n. of predictors).
: Quantitative or categorical
: non-zero variance and without perfect
(the correlation between predictors should not be too high)
: No correlation to known external variables
Homocedasticity of residuals
: the residuals should present a constant variance.
: No autocorrelation - important for timed variables , when a variable in a certain model is highly correlated to the value in the previous moment.
: Normal Distribution
: Mean values of the outcome variable must be near zero.
Only when all the assumptions are met we can believe that we have an umbiased model, i.e. on average we can predict for the population.
How to measure assumptions?
Multicolinearity: if there is strong collinearity between independent variables we are in the presence of a circular problem, than we do not arrive at a stable model.
1. Scan at the correlation tables and spot correlations baove 0.8.
2. VIF (Variance inflation factor): this value should be smaller than 10, though above 5 you should take a second look and think carefully.
3. Tolerance= 1/VIF under 0,1 indicates serious problems; under 0,2 take your time to double check.
Autocorrelation: Measured through Durbin-Watson statistic , with a value around 2 meaning that the residuals are uncorrelated.
do not hold, there are some corrections you may try... but mostly you will not be able to generalize the model.
No non parametric options ...
Report the coefficient table:
And The model Rsquare = 0,83
Remember we were calculating the influence of each variable on our outcome variable.
Child Agression = -0,005+0,057PS+0,082SA+0,142CG-0,04D
Note that TV was excluded because it had a sig = 0,475 (non- significance)
A positive value means a positive relation to the dependent variable, while a negative means a negative relationship.
Each coefficient tells us how the outcome would change if all the other predictors were held constant.
For example: if PS (parent Style increases one point child aggression (on average) will increase by 0,057 points.
Beta allow for a comparison in standard deviations as it is measured in standardized values: so for one standard deviation increase on PS, child agression would increase 0,177. We can read in percentages what makes them more intuitive. If PS changes 100% child agression changes 17,7%
Autocorrelação, que é verificada pela estatística Durbin Watson (proxima de 2), tem a ver com a correlação dos residuos na regressão linear. Os residuos são tudo o que não entra na constante nem no Beta, está compleatemente fora da capacidade explicativa do modelo.
Há vários motivos para termos uma autocorrelação elevada o principal dos quais é: o modelo não presta. Tem um baixo r2, e por conseguinte tem um erro muito grande. O erro dos diversos individuos vai estar inevitavelmente relacionado uns com os outros, pelo simple facto de que todos os erros são grandes.
Quando temos um bom "fit" do modelo aos dados, e ainda assim temos um problema de autocorrelação, este tem a ver com a existencia de padrões de resposta, que pode sugerir a falta de indepencia entre casos. A nível gráfico, os PP plots têm um padrão. Existe coincidencia de resposta entre os individuos da amostra, sugerindo que estes não são independentes uns dos outros. Pode ser causado por efetiva falta de independencia, ou pelo simples fato de a pergunta conduzir a repostas muito similares entre individuos, i.e. não conduz à diferenciação dos casos, no limite é uma não variável porque não tem variabilidade (todos respondem o mesmo).