**REGRESSION**

**STATISTICS FOR BEGINNERS**

REGRESSION - definition and conditions

Least Squared Method for Regressions

So far ...

How Accurate is my model?

Assumptions

How to make sense and report regression results

Outliers' Diagnostic

**ISABEL FLORES**

1 continuous dependent variable

1 or more predictor variables

Independent variables may be continuous or categorical (dummies), or mixed.

You may have the same participantes, different participants.

What is a Regression?

Is a mathematical function that has the objective of finding relationships between variables.

Linear Regressions

Proposition of mathematical linear models:

F(x)= a + bx

Whe we do a regression we calculate the line that best fit our points. So we can estimate the value of y based on x observation.

The most common method is the "Least Squared Methods"

The regression line is always the one that minimizes the distances between itself and all the observations

Lets Try ...

child.agression.sav

When we run a regression the interpretation is:

ANOVA: Ho: there is no difference between the model and the mean of the dependent variable

H1: The model adds information to the mean

R and R squared have the same interpretaion as in correlation. R2 is read as the percentage amount that is explained by the independent variables. It indicates the quality of fit from the points to the regression line.

Coeficients:

B (constant) - if the independent variable was zero, than the dependent variable would assume the value of B (constant)

B (x) - when we add 1 point to x we can predict an increas of B in y.

What are the assumptions?

In Regressions the main assumption is post-model and is given by the observation of RESIDUAL DISTRIBUTION, except for the first assumptions that is having a linear initial plot.

Remember that by looking at the erros we re indirectly looking at all the variables considered in the model.

y= b0 + b1x + Error

Error = y - (b0+b1x)

Error = y - ^y^

T-test

ANOVA

CORRELATION

Mean is our best predictor

Considers possibility of better predictors than the mean ... starting of regression.

The conclusions were following a certain maniputalion (by one predictor in two or more categories) one can (can't) assume that on average the results are sufficiently distant (or not) from our initial observations.

Our best estimate for the new value fall within a confidence interval

We want to observe how the differences can be observed for different levels of our input variable. Point by point: is there a better model than just the mean?

regression aims at discovering the model that explains the relation between the dependent and independent variables.

It is all the same as in simple regressions we just introduce multiple predictors

Why?

Normally one predictor is responsible only for a small part of the variation of the dependent variable.

We want to achieve a model with a strong predicting power.

We also want to control the effect of other variables, and test the reaction to a particular value

Y= a + bx + error

Y= (model) + error

Y = (a +b1x1+b2x2+b3x3+ ... bnxn)+ error

Back to our example...

childagression.sav

Choosing a method of regression

There are several methods ad they are related to the order of entrance of each variable in the model.

Enter

BlockWise

Stepwise

Remove

Forward

Backward

ENTER

Pre-defined on SPSS, assumes that the researcher has a reason, based on theory or previous knowledge, on the order of influence.

Block wise

Normally used when we have two blocks of variables that we want to test - a block that previous research has tested, and some new variables we want to acess.

Stepwise.:

The decision is made based on mathemtical criteria, as the computer choses which variable to try, normally based on individual correlations. There are three main methods. Normally used for background work and not to publish as it is hardly ever reproducible.

Are there any influential cases ?

Can this model generalize to different samples?

1. Serious outliers

We recognise serious outliers when we standartize the residuals and there are residuals higher than 3; if more than 1% of the sample has outliers bigger than 2,5; if more than 5% as outliers bigger than 2.

Cook's distance smaller than 1 means the outliers are not influencing the model

Samples

: random, representative

Sample size

: Big enough at least

15 cases for each predictor

- 3 predictors means at least 45 cases.

But ideally the rule should be 50+8k

(k is the n. of predictors).

Variable types

: Quantitative or categorical

Variable

: non-zero variance and without perfect

multicolinearity

(the correlation between predictors should not be too high)

Errors

: No correlation to known external variables

Homocedasticity of residuals

: the residuals should present a constant variance.

Errors

: No autocorrelation - important for timed variables , when a variable in a certain model is highly correlated to the value in the previous moment.

Errors

: Normal Distribution

Linearity

: Mean values of the outcome variable must be near zero.

Only when all the assumptions are met we can believe that we have an umbiased model, i.e. on average we can predict for the population.

How to measure assumptions?

Multicolinearity: if there is strong collinearity between independent variables we are in the presence of a circular problem, than we do not arrive at a stable model.

1. Scan at the correlation tables and spot correlations baove 0.8.

2. VIF (Variance inflation factor): this value should be smaller than 10, though above 5 you should take a second look and think carefully.

3. Tolerance= 1/VIF under 0,1 indicates serious problems; under 0,2 take your time to double check.

Autocorrelation: Measured through Durbin-Watson statistic , with a value around 2 meaning that the residuals are uncorrelated.

If

assumptions

do not hold, there are some corrections you may try... but mostly you will not be able to generalize the model.

No non parametric options ...

Report the coefficient table:

And The model Rsquare = 0,83

Remember we were calculating the influence of each variable on our outcome variable.

Child Agression = -0,005+0,057PS+0,082SA+0,142CG-0,04D

Note that TV was excluded because it had a sig = 0,475 (non- significance)

A positive value means a positive relation to the dependent variable, while a negative means a negative relationship.

Each coefficient tells us how the outcome would change if all the other predictors were held constant.

For example: if PS (parent Style increases one point child aggression (on average) will increase by 0,057 points.

Beta allow for a comparison in standard deviations as it is measured in standardized values: so for one standard deviation increase on PS, child agression would increase 0,177. We can read in percentages what makes them more intuitive. If PS changes 100% child agression changes 17,7%

Autocorrelação, que é verificada pela estatística Durbin Watson (proxima de 2), tem a ver com a correlação dos residuos na regressão linear. Os residuos são tudo o que não entra na constante nem no Beta, está compleatemente fora da capacidade explicativa do modelo.

Há vários motivos para termos uma autocorrelação elevada o principal dos quais é: o modelo não presta. Tem um baixo r2, e por conseguinte tem um erro muito grande. O erro dos diversos individuos vai estar inevitavelmente relacionado uns com os outros, pelo simple facto de que todos os erros são grandes.

Quando temos um bom "fit" do modelo aos dados, e ainda assim temos um problema de autocorrelação, este tem a ver com a existencia de padrões de resposta, que pode sugerir a falta de indepencia entre casos. A nível gráfico, os PP plots têm um padrão. Existe coincidencia de resposta entre os individuos da amostra, sugerindo que estes não são independentes uns dos outros. Pode ser causado por efetiva falta de independencia, ou pelo simples fato de a pergunta conduzir a repostas muito similares entre individuos, i.e. não conduz à diferenciação dos casos, no limite é uma não variável porque não tem variabilidade (todos respondem o mesmo).