Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Transcript of Logistic Regression
Statistics for Beginners - 9th session
(Mann- Whitney, Wilcoxon) - compare means for two categories
(Krushal - Wallis, Friedman) - compare means for more than two categories
(Spearman)- compare one continuous predictor
- compare two or more predictors.
What are the questions when we have categorical independent?
How to explain:
death/ alive; Diabetic / non Diabetic; Repeated / did not repeat; Married / unmarried ...
Assumptions and what can go wrong
- linear regression to the logit of the outcome variable (y*) - you need to create the ln of every variable in the model and run a logistic regression with the crossed variables. Any significant values means that linearity is not achieved.
Independence of Errors
-The cases must be independent
- predictors should not be too highly correlated, and for this we use tolerance
Chi Square Test
Compare two categories:
Is there a relation between learning to dance and the type of reward?
Dependent Variable - two categories
Independent variable - any sort of variable
- one categorical predictor
- two or more predictors
Sometimes we want to look at relationships between categorical variables (dependent) other variables (independent)
So we are starting with the simple case
Compare two categorical variables.
We cannot use the mean of two categorical variables...
on average we cannot be 0,56 alive (or married, or diabetic ...)
Remember that the values attributed to categorical variables are arbitrary. Therefore we should always analyze FREQUENCIES and not means.
our example is: cats.sav
The question is: what is the best reward to teach cats how to dance.
: Cat learned to dance / cat did not learn to dance
Reward (Food ; Affection)
It tests the null hypothesis - nothing happens
H0: There is no difference in reward system when we teach cats to dance;
H1: There is a difference in rewards
The formula that Chi-Square uses is the comparison between expected values and observed values
Computing a Chi-square for our cats
We get a lot of values, on top of Pearson Chi-Square.
Fisher's Exact Test
- use when a sample is small
- an alternative model to Pearson that compares observed frequencies with those obtained in the model - we'll see more on this in logistic regression
- Is a correction to Pearson test used for two categorical variables with two categories.
1. Independence of data
2. Expected frequencies greater per category at least 5 (meaning that if you are testing two categories 10 observations is the minimum)
Calculating the Effect size
Using Symmetric measures you get the grip of the effect as it is given in an interval 0-1 with zero meaning no effect and 1 meaning 100% effect.
up to 0,2 - small effect
0,2-0,5 - medium effect
ab0ve 0,5 - big effect
This type of regression allows to predict categorical outcomes based on predictor variables.
Which category an individual is likely to belong given certain information.
we can predict if the next person to come in is a male or a female based on:
laziness;stubbornness; multi-tasking ability, aggressiveness
We can predict the probability of a person having a cancer; or high blood pressure or diabetes, or even surviving.
Let's start with an example:
The problem is to access whether diabetes is linked to excess weight.
In graphic terms, logistic regression is trying to fit a line to the probability of occurrence.
The coefficients are read as the rise or reduction on the log(likelihood) of the event. If you actually want to calculate the probability you have to use the following expression:
Please note that logistic regression will have a weak performance when there is no overlapping of data.
Where y* is the forecast value of the logistic regression and p the probability of the original event.
You need data in all possible combinations of the values, as we cannot guess what the outcome could be. Always do a crosstabs and check for empty boxes. Or when continuous predictors look for large standard errors, as they sign this type of problem.
When the size of the subsets is too different, more than a ratio of 70% will add trouble to estimation power. As there are not enough case in each possibility to compare.
Understanding ODD RATIOS EXP(B)
EXP (B) = Probability / (1- Probability) while
Probability (a) = (number of 'a' / total)
EXP (B) is the same as the ratio of the probability of the "event" to the probability of "no event"
Therefore we cannot say that the probability of the positive effect is the exp(B), we need to state that the odds of a positive effect is exp (B)
What is the difference between "probability"and "odds"?
In a very simple example, lets consider a bag with 200 balls.
we want to calculate the probability of taking an yellow ball, given that 40 are yellow and the rest is not yellow
To calculate this probability I do:
p(yellow) = Yellow/Total= 40 /200= 1/5
I can say that the probability is 20% or 1 in 5.
But what are the odds? Well this is a different concept, I want to compare the probability of taking a yellow to the probability of not taking a yellow.
ODDS = P/ (1-P) = (40/200) / (1- (40/200)) = 0,20/ 0.80 = 0,25
So I state that the odds are 0,25, this is readable as for every 100 non yellow balls there are 25 yellow balls.
Probability of a yellow ball 20% - out of the total number of balls I have a 20% probability of getting a yellow one.
Odds of a yellow ball 0,25 - this is for 100 non yellow balls the bag contains 25 yellow.
Back to our diabetes example:
Running logistic regression on diabetes explained by age and cholesterol I can make the following interpretation:
Exp (B) Age = 1,045
For every additional year the ratio (probability of (diabetic/non diabetic)) is 1,045. Meaning that the risk of diabetes increases slightly every year, all the rest constant.
We could also read this as: for every 100 people who will not develop diabetes linked to age there will be 104 people who will develop diabetes due to one more year in age.
To turn into probabilities you could solve: 104/204 =0,51 meaning that the risk of diabetes for one year is just 1% towards diabetes (remember that 50% is equal probability of event).