**Correlation**

Last week we discussed intact group designs,

which investigate the relationship between

a distinguishing construct of interest (which

categorically divides a population) and a

dependent variable, which can be either

categorical or continuous.

For example

Do parents of children with autism

exhibit more stress than parents of children

with down syndrome?

Distinguishing Construct: Diagnosis

Dependent Variable: Valid and Reliable Stress Index

Does this research involve experimental manipulation?

Because participants existed in their groups prior to our investigation (rather than being randomly assigned), we often can't establish temporal precedence of the "IV", and we can't control for the influence of third variables, SO WE CAN

NEVER MAKE CAUSAL INFERENCES WITH THIS DESIGN.

Similarly, we can investigate

the relationship between two

continuous variables using a

correlational design.

For this design we need:

a population of interest

at least 2 continuous variables of interest

For example:

Are math age equivalency scores correlated with reading age equivalency scores for children in middle school?

What is the population?

What are the variables?

Categorical or continuous?

Can we establish a relationship between variable x and variable y?

Can we establish temporal precedence of variable x?

Can we rule out other possible explanations for the association?

Correlational designs are plagued by the same limitations as intact-group designs. Just like the saying

goes, "correlation is not causation." Of course, we

hear this all the time, but we repeatedly treat correlational data as if it DOES show causality.

Deb Roy got 24/7 video data on

his son's language development in the house.

Watch 4:20 - 7:54

What impression do you come away with?

What is "causing" his son to acquire each word?

Despite the limitations, there is an important advantage of correlational designs over intact-group. What do we gain from keeping our "independent variable" continuous?

TOWRE Scores

Bad Reader

Good Reader

When we dichotomize a continuous variable, even if our dichotomization is based on theory, we lose information. What is true distinction between a good reader and a bad reader? Where is the line? If we use a correlational design, we never have to draw that line.

We can use correlational

designs to examine

relationships between

co-occurring variables

(concurrent) or those with

temporal precedence

(longitudinal)

These relationships

are of great importance to us

in education, because we often

examine them to establish the validity

of our assessments.

Our measure of reading ability should

correlate with other validated measures

of reading, and it should correlate with

component skills of reading, such as

phonological awareness. This verifies the

idea that we are measuring what we say

we are measuring.

It's important to choose a population that is homogoneous, and formulate falsifiable

hypotheses prior to examining the relationships

between two variables.

Here is a plot of the correlation between physical strength and math ability for middle school students. We can think of an explanation for why this relationship exists. Testosterone levels are

associated with both strength and math comprehension skills, so

perhaps that is the underlying cause of this relationship.

But here is the same plot, with sixth graders mapped in red and eighth graders mapped in blue. What's the likely third variable causing this relationship? Age. Because our population wasn't homogenous, we saw a relationship that didn't exist. Because we didn't have a falsifiable hypothesis based on theory, once we saw that relationship, it was easy to invent an explanation.

We can also establish the validity of a measure

by showing that it predicts something of great importance to us.

Autism is usually diagnosed between ages

2-4. We know that early intervention can positively effect outcomes, so early identification of treatment needs is of great interest to us. If our measure at 6 months is highly correlated with severity of autism at age 4, it would have predictive validity.

How are we analyzing the

relationship between our

variables of interest?

+

I can plot univariate data on a line to get a sense of the spread, or I can combine them to get...

Bivariate data! Now we see a visual representation of the relationship between both of these variables.

The slope of the line indicates the direction of the relationship, and the distance of the data points from the line indicates the strength of the relationship.

r(x,y) = cov(x,y)/s(x)s(y) = .112

We can calculate the

correlation, which tells

us the magnitude and

direction of the relationship.

If we re-examine our earlier data, we see

that our line is tilted up slightly, so the

relationship is positive, but the average

distance of our data points from the line of

best fit is pretty large, and so the magnitude

of our correlation is not very big.

When the data lines up perfectly,

then we have a perfect correlation.

This means that if we know an

individual's score on a single

variable, then we can know her

exact score on the correlated

variable. It gives us 100% prediction

power. (In reality, this never

happens).

Reading Scores

1. Bobby

2. Billy

3. Jenny

4. Jesse

5. Amber

6. Molly

7. George

8. Esther

9. Johnny

10. Jason

11. Julie

12. Lauren

13. Jimmy

14. Carrie

15. Samantha

16. Jonah

17. Heartley

18. Emma

19.Matt

20. Brent

21. Drew

22. Sarah

23. Rosey

24. Jane

25. Michael

Math Scores

1. Bobby

2. Billy

3. Jenny

4. Jesse

5. Amber

6. Molly

7. George

8. Esther

9. Johnny

10. Jason

11. Julie

12. Lauren

13. Jimmy

14. Carrie

15. Samantha

16. Jonah

17. Heartley

18. Emma

19.Matt

20. Brent

21. Drew

22. Sarah

23. Rosey

24. Jane

25. Michael

Again, this is the same as getting a 100% stable ranking. This is why reliability in group studies is often measured as an intraclass correlation (ICC). It's the extent

to which both raters ranked the participants in a similar order.

We can square the correlation (r)

to get r^2, an index of the proportion of

variance explained.

Let's say Micheal tests pretty poorly

on a standardized measure of math.

(Univariate Descriptive)

Good at smiling.

Bad at math.

At this point, we don't know why

Micheal has done more poorly than expected

(the mean). Perhaps there is something

influencing her score. Perhaps there are multiple things. Perhaps we have some measurement error.

But what if we find

that there is a pretty

strong negative correlation

between the number of

days premature at birth

and standardized math

scores. Micheal was 1

month premature. When

we graph the bivariate

data, Micheal has a new

distance from a new

expected value.

In this scenario, Micheal has two estimates of variance. We can examine these estimates of variance for each person in the study, and make a ratio of unexplained variance to total variance.When we subtract that ratio from 1, we get the proportion of variance in y that is accounted for by x.

We want to explain

all of our variance to

show that none of it is

resultant from random

error. How can we do that?

Multiple regression

allows us to examine the correlation between a criterion variable and two or more predictor variables. We can explain the relationship with the formula for a plane. Because it is multidimensional, it will have multiple slopes.

When we use multiple regression to analyze the relationship between variables, we can calculate beta weights (which help us understand the magnitude and direction of the effect for each variable) and an R value, which we can square, to get the amount of variance in y explained by our predictors.

an example

Research question: Does high school GPA, total SAT score, or quality of letters of recommendation significantly predict college GPA?

R squared = .425

So, do these variables predict college GPA?

Assumption of Linearity

Regression assumes there is a linear relationship

between the predictors and the predicted.

Nonlinear relationships will change your results.

Confidence Intervals

Recall that in order to estimate confidence intervals

around the mean of a sample, or a mean difference between two samples, we needed to a distribution of means from repeated samples of a population.

In regression, we want to estimate confidence

intervals around our slopes. Consequently, we

need a distribution of slopes.

Here is data on two variables of interest from the National Longitudinal Study of Freshman (NLSF; Singer, 2006). Let's call this our population, and the red line our population regression line.

We can repeatedly sample this population and get a slope each time.

Here's one.

Here's another.

And another.

Here's a bunch more

Here they're graphed

together: estimated regression

lines for 10, 100, and 1000

samples.

and here it's plotted as a distribution of slopes.

In other words, if we ranked

participants by math score,

and again by reading score,

would the rankings be similar?

If the measures are perfectly

correlated, then our rank order

will be the same.

Math Scores

1. Bobby

2. Billy

3. Jenny

4. Jesse

5. Amber

6. Molly

7. George

8. Esther

9. Johnny

10. Jason

11. Julie

12. Lauren

13. Jimmy

14. Carrie

15.Samantha

16. Jonah

17. Heartley

18.Emma

19.Matt

20. Brent

21. Drew

22. Sarah

23. Rosey

24. Jane

25. Michael

This individual has a

score of 19 on the x

measure and 50 on the

y measure.

These distances outlined in pink, between each data

point and the line of best fit (expected value) are the

residuals. Residuals, like deviations from the mean, can be averaged to provide a sample estimate of error.

We always estimate our error with some measure of

the extent to which our values have strayed from the

expected value (variance).

Total Variance in y

Explained Variance

The idea is that some of our variance, which could have been due to random error, is actually not random, and explained by our scores on another variable.

Reading Scores

1. Bobby

2. Billy

3. Jenny

4. Jesse

5. Amber

6. Molly

7. George

8. Esther

9. Johnny

10. Jason

11. Julie

12. Lauren

13. Jimmy

14. Carrie

15.Samantha

16. Jonah

17. Heartley

18.Emma

19.Matt

20. Brent

21. Drew

22. Sarah

23. Rosey

24. Jane

25. Michael

predicted variable

intercept

slope (effect of) predictor 1

slope (effect of) predictor 2

In this case, despite a clear

relationship between the variables,

your fit will be poor.

Other things to remember:

Betaweights should be interpreted conservatively;their magnitude depends on the order with whichyou enter the variables.The more variables you add to a regression equation,the more variance you will explain in your sample,but if this isn't based on theory, it's not likely to replicate

so now we can calculate our confidence interval as

slope estimate +/- critical value of distribution x standard error of the slope

What do we do to deal with the

effects of nuisance variables in

correlational studies?

also, this.

For example, what if we want to

examine the effects of nutrition

on academic achievement?

We could use a measure of the nutritional environment

such as the Nutrition Environment Measures Survey

And we could measure academic achievement with

standardized end of the year test scores.

but we know that the nutritional environment

is also correlated with SES, which can also influence

academic achievement

How do we control

for the effect of this

nuisance variable?

Part and partial correlations help

us remove the effects of variables

that are correlated with our

constructs of interest.

We calculate them by examining the shared variance. Remember our different residuals?

We can visualize the residuals (variance)

of all three variables to look like this. Some

variance is associated with only one variable.

Other variance is shared between two

variables. Some variance is shared by

all three variables.

The part correlation

helps us express a

more conservative

estimate of the

relationship between

nutrition and

academic achievement.

Part correlation for nutrition is

b/a+b+c+e

See how we took the shared residuals

between Nutrition and SES out of our

correlation metric?

The partial correlation is even more conservative.

Partial correlation for nutrition is b/b+e

We're removing ALL of the residuals shared by SES

Let's process these concepts

by designing a correlational

study together.

*image adapted from slides created by James Steiger

*image adapted from slides created by James Steiger

*image adapted from slides created by Jennifer Gilbert

*image adapted from slides created by Jennifer Gilbert

*image adapted from slides created by Jennifer Gilbert

*image adapted from slides created by Jennifer Gilbert

*image adapted from slides created by Jennifer Gilbert