**Chi Square and Logistic Regression**

**Statistics for Beginners - 9th session**

T- tests

(Mann- Whitney, Wilcoxon) - compare means for two categories

ANOVA

(Krushal - Wallis, Friedman) - compare means for more than two categories

Pearson correlations

(Spearman)- compare one continuous predictor

Linear Regression

- compare two or more predictors.

GOAL!

Thank you!

What are the questions when we have categorical independent?

How to explain:

death/ alive; Diabetic / non Diabetic; Repeated / did not repeat; Married / unmarried ...

Assumptions and what can go wrong

Linearity

- linear regression to the logit of the outcome variable (y*) - you need to create the ln of every variable in the model and run a logistic regression with the crossed variables. Any significant values means that linearity is not achieved.

Independence of Errors

-The cases must be independent

Multicollinearity

- predictors should not be too highly correlated, and for this we use tolerance

Chi Square Test

Compare two categories:

Is there a relation between learning to dance and the type of reward?

Logistic Regressions

Dependent Variable - two categories

Independent variable - any sort of variable

**ISABEL FLORES**

Continuous Outcome

Categorical Outcome

Qui Square

- one categorical predictor

Logistic Regression

- two or more predictors

Sometimes we want to look at relationships between categorical variables (dependent) other variables (independent)

So we are starting with the simple case

Compare two categorical variables.

We cannot use the mean of two categorical variables...

on average we cannot be 0,56 alive (or married, or diabetic ...)

Remember that the values attributed to categorical variables are arbitrary. Therefore we should always analyze FREQUENCIES and not means.

our example is: cats.sav

The question is: what is the best reward to teach cats how to dance.

Dependent Variable

: Cat learned to dance / cat did not learn to dance

Independent Variable:

Reward (Food ; Affection)

It tests the null hypothesis - nothing happens

H0: There is no difference in reward system when we teach cats to dance;

H1: There is a difference in rewards

The formula that Chi-Square uses is the comparison between expected values and observed values

Computing a Chi-square for our cats

We get a lot of values, on top of Pearson Chi-Square.

Fisher's Exact Test

- use when a sample is small

Likelihood Ratio

- an alternative model to Pearson that compares observed frequencies with those obtained in the model - we'll see more on this in logistic regression

Continuity Correction

- Is a correction to Pearson test used for two categorical variables with two categories.

The Assumptions:

1. Independence of data

2. Expected frequencies greater per category at least 5 (meaning that if you are testing two categories 10 observations is the minimum)

Calculating the Effect size

Using Symmetric measures you get the grip of the effect as it is given in an interval 0-1 with zero meaning no effect and 1 meaning 100% effect.

up to 0,2 - small effect

0,2-0,5 - medium effect

ab0ve 0,5 - big effect

This type of regression allows to predict categorical outcomes based on predictor variables.

Which category an individual is likely to belong given certain information.

we can predict if the next person to come in is a male or a female based on:

laziness;stubbornness; multi-tasking ability, aggressiveness

We can predict the probability of a person having a cancer; or high blood pressure or diabetes, or even surviving.

Let's start with an example:

diabetes_data.sav

The problem is to access whether diabetes is linked to excess weight.

In graphic terms, logistic regression is trying to fit a line to the probability of occurrence.

The coefficients are read as the rise or reduction on the log(likelihood) of the event. If you actually want to calculate the probability you have to use the following expression:

Please note that logistic regression will have a weak performance when there is no overlapping of data.

Where y* is the forecast value of the logistic regression and p the probability of the original event.

You need data in all possible combinations of the values, as we cannot guess what the outcome could be. Always do a crosstabs and check for empty boxes. Or when continuous predictors look for large standard errors, as they sign this type of problem.

When the size of the subsets is too different, more than a ratio of 70% will add trouble to estimation power. As there are not enough case in each possibility to compare.

Understanding ODD RATIOS EXP(B)

EXP (B) = Probability / (1- Probability) while

Probability (a) = (number of 'a' / total)

EXP (B) is the same as the ratio of the probability of the "event" to the probability of "no event"

Therefore we cannot say that the probability of the positive effect is the exp(B), we need to state that the odds of a positive effect is exp (B)

What is the difference between "probability"and "odds"?

In a very simple example, lets consider a bag with 200 balls.

we want to calculate the probability of taking an yellow ball, given that 40 are yellow and the rest is not yellow

To calculate this probability I do:

p(yellow) = Yellow/Total= 40 /200= 1/5

I can say that the probability is 20% or 1 in 5.

But what are the odds? Well this is a different concept, I want to compare the probability of taking a yellow to the probability of not taking a yellow.

ODDS = P/ (1-P) = (40/200) / (1- (40/200)) = 0,20/ 0.80 = 0,25

So I state that the odds are 0,25, this is readable as for every 100 non yellow balls there are 25 yellow balls.

Summing up:

Probability of a yellow ball 20% - out of the total number of balls I have a 20% probability of getting a yellow one.

Odds of a yellow ball 0,25 - this is for 100 non yellow balls the bag contains 25 yellow.

Back to our diabetes example:

Running logistic regression on diabetes explained by age and cholesterol I can make the following interpretation:

Exp (B) Age = 1,045

For every additional year the ratio (probability of (diabetic/non diabetic)) is 1,045. Meaning that the risk of diabetes increases slightly every year, all the rest constant.

We could also read this as: for every 100 people who will not develop diabetes linked to age there will be 104 people who will develop diabetes due to one more year in age.

To turn into probabilities you could solve: 104/204 =0,51 meaning that the risk of diabetes for one year is just 1% towards diabetes (remember that 50% is equal probability of event).