**Statistics for Social Research**

Professor McCabe

Professor McCabe

t-tests/difference in means

ANOVA (analysis of variance)

relationships between nominal

variables (e.g., chi squared)

correlation and linear regression

multiple regression

thinking about causality

final review

t-tests/difference in means

ANOVA

(analysis of variance)

review: why we never

"accept" the null hypothesis

Group A

: You're testing the null hypothesis that

the mean GPA of students at Georgetown is equal

to 3.00 You draw a

sample of 155 students

. Your

sample mean is 3.09 and your standard deviation

is 0.9. Set alpha equal to 0.05. What is your critical

value? Calculate your test statistic. Make a rejection

decision.

Group B

: You're testing the null hypothesis that

the mean GPA of students at Georgetown is equal

to 3.00 You draw a

sample of 425 students

. Your

sample mean is 3.09 and your standard deviation

is 0.9. Set alpha equal to 0.05. What is your critical

value? Calculate your test statistic. Make a rejection

decision.

Group A:

Xbar = 3.09

sd = 0.9

n= 155

se = 0.073

Z = 1.245

Z<CV

Fail to reject

Group B:

Xbar = 3.09

sd = 0.9

n= 425

se = 0.044

Z = 2.062

Z>CV

Reject

Last week:

one-sample

hypothesis test,

means was different from a target value

This week:

two-sample test for the difference

in means

; mean of one group differs

from the mean of another group; ANOVA to

test for mean differences across multiple

groups.

Concept: Bivariate

Relationship

A bivariate relationship is simply the relationship between two (bi) variables (variate).

We are interested in thinking about how levels of one variable (the dependent variable) change across levels of another variable (the independent variable)

This week: How levels of a continuous variable change across levels of a discrete (dichotomous, nominal, ordinal) variable.

Next week: How levels of a discrete variables change across levels of a discrete variable.

The following week: The relationship - or correlation - between two continuous variables.

This week: Does the number of hours spent studying (continuous variable) differ between male and female students (discrete variable).

Next week: Does voting in a GUSA election (dichotomous - yes/no) differ between freshman, sophomores, juniors and seniors (ordinal).

The following week: Is the number of basketball games won by the Georgetown Hoyas each year (continuous) associated with the average GPA of the Georgetown student body (continuous)

This week, we will be looking at continuous variables (e.g., height, GPA, test scores, rates, etc.) across levels of discrete variables (e.g., race, sex, year in school, dichotomous, etc.)

Concept: Two Sample

Difference in Means

Rules for a Two Sample Difference in Means

1. The variable you're interested in is continuous.

2. Your groups are independent. In other words,

they do not include the same people/subjects.

3. There is equal variance in the populations. We

often assume this to be true, and check the variance

(or standard deviation) in the samples for

confirmation.

Concept: Standard Error

for the Difference in Means

Standard Error for the

Difference in Means

Variance for the first sample

Degrees of freedom for the first sample

Degrees of freedom

Concept: Steps

for a t-test

Null Hypothesis: The mean from

the first population equals the mean from the second population

For the difference in means where equal

variance is assumed, the test statistic is

called the

t-test

.

t-test for the difference in means

the mean of sample 1

the mean of sample 2

standard error of the difference in means

1. State the Null Hypothesis

2. State the Research Hypothesis

3. Get your sample statistics (Xbar).

Calculate the standard error.

5. Decide on a one-tail or

two-tail test.

6. Decide on your level of

alpha. Determine the critical

value.

4. Calculate your test statistic.

7. Make a rejection decision.

8. Interpret your results.

Practice Problem

In a random sample of American adults (n=641), researchers wanted to know whether men and women hold different attitudes towards gun control. Survey respondents were asked ten questions about gun control. The answer to each question was coded "1" if the respondent supported stricter gun control laws and "0" if the respondent did not support stricter gun control laws. After aggregating these questions, each respondent ended up with a score between 0 and 10.

The researchers found that in the sub-sample of men (n=324), the mean score was 6.2 and the standard deviation was 1.3. In the sub-sample of women (n=317), the mean score was 6.5 and the standard deviation was 1.4. Can the researchers conclude that there is a gender difference in attitudes toward gun control?

Xbar1 = 6.2

Xbar2 = 6.5

se = 0.10668

t=-2.812

cv=1.96

reject the null

Concept: Matched

Pair Sampling

In our t-test thus far, we have required independent samples (without overlap between the two samples). Another kind of t-test to test for mean differences involves "matched-pair" sampling. We can think of this as pre/post or before/after sampling on the same group of people.

We run an intervention in freshman dorms to teach students about diversity. I take a sample of freshman and give them a pre-test (before the diversity training) and a post-test (after the diversity training) to determine the effect of the training.

Standard Deviation

Difference between

matched-pair scores

Mean of the differences

between the matched-

pair scores

Standard Error

Tests Statistic for Matched-Pair Scores

Researchers would like to measure whether liberals and conservatives report different levels of support for health care reform in the United States. On a scale of 0-100, with 100 being strongly support, they ask a sample of 60 people how much they support health care reform.

In their sample of 25 liberals, they find a mean score of 60 and a standard deviation of 12. In their sample of 35 conservatives, they find a mean score of 49 and a standard deviation of 14.

Xbar1 = 60

Xbar2 = 49

se = 3.52

t = 3.13

df = 58

If alpha = 0.05, cv = 2.021

(note: t-distribution because

n<121)

t>cv

reject null

Concept: ANOVA

We used a t-test to compare the difference in means between two groups (e.g., male/female, students/professors, college graduates/non-graduates). But what happens when we want to compare the means between three or more groups (e.g., freshmen/sophomore/junior/senior; Protestants/Catholics/Jews/Others). In those cases, we use an F-ratio calculated through an analysis of variance.

Considered as an extension of a t-test, we can write out the null hypothesis as follows:

In principle, the ANOVA examines two types of variation. First, we are looking for

within-group

variation. Within each group (e.g., Republicans, Democrats and Independents), how much variation is there around the group mean. Second, we are looking for

between-group

variation. How much does the mean

score for each group (e.g., Republican, Democrats and Independents) vary.

ANOVA compares the amount of variation

between

categories (e.g., between R, D and I) with the amount of variation

within

categories (e.g., among R, D and I)

The greater the differences between categories (means) relative to the differences within categories (standard deviations), the more likely we are to reject the null hypothesis.

If the mean score does, in fact, vary across categories, we would expect the sample means between categories to be differ substantially, but the dispersion within categories to be relatively small.

Does support for capital punishment (measured on a scale of 1-10) vary across religions?

Is Hoya Pride (on a scale of 1-10) different across years in school?

Does average BMI change across regions of the country?

Is median income different is cities, suburbs and rural areas?

Do Democrats, Independents and Republicans vary in their score on a political ideology test?

Concept: Sum of

Squares Total (SST)

Concept: Sum of

Squares Between (SSB)

Concept: Sum of

Squares Within (SSW)

SST = the Total Variation of Scores

The SST measures the amount of variation

in the scores, relative to the Grand Mean

(or the mean of the total sample).

SSW = the Total Variation within categories.

The sum of squares within measures the amount

of variation within each of the categories (or how

far the individual scores fall from the group mean).

SSB = the total variation between categories.

The sum of squares between indicates how much

variation there is between the mean of each category.

Concept: Degrees of

Freedom (ANOVA)

dfw = degrees of freedom associated with SSW

Take the total number of observations.

Subtract the number of categories.

dfb = degrees of freedom associated with SSB

Take the total number of categories.

Subtract 1.

Concept: Mean

Square Estimates (MSE)

The Mean Square Estimates (MSE) are estimates of the population variance, and are calculated by dividing the

sum of squares by the degrees of freedom.

Mean square within =

Mean square between =

Concept: F ratio

The F ratio is your test statistic for an ANOVA. It compares the amount of variation between categories (SSB) to the amount of variation within categories (SSW).

(It is the equivalent to your t-test from an analysis of the mean difference between two categories.)

As with our t-tests, the higher the F-ratio, the more likely we are to reject the null hypothesis. In other words, the more variation there is between categories relative to the amount of variation there is within categories, the more likely we are to reject the null hypothesis that the mean score for each of the categories is the same.

Concept: Steps to

Running an ANOVA

1. State your assumptions.

- Independent, random samples

- Continuous outcome

- Populations are normally distributed

- Population variances are equal across groups

2. State your null hypothesis

3. Determine your degrees of freedom.

Decide on alpha. Look at the F-distribution.

Determine your critical value.

4. Calculate your test statistic (F-ratio) using SSW, SSB, MSE, dfw, dfb)

5. Compare your test statistic with the critical value and make a rejection decision.

6. Interpret your decision.

Review: One sample,

Two samples, Three

(or more) samples ...

1: One sample, compared to a hypothesized mean.

t-test (small sample) or Z-scores.

2: Two groups, comparison of means.

t-test (small samples) or Z-scores.

3: Three or more groups, comparison of means.

F-ratio

1: Is the average GPA for

Georgetown students equal to 3.50?

2: Is the average GPA for Georgetown females

greater than (or different from) the average

GPA for Georgetown males?

3: Does the average GPA differ between Freshmen,

Sophomores, Juniors and Seniors?

1: Is the mean score on a religious tolerance exam (scores: 1-10) greater than 5?

2: Are people who go to church regularly more tolerant than those who don't regularly go to church?

3: Does the level of religious tolerance vary by the frequency of church going (e.g., At least once a week, at least once a month, at least once a year, never)

chi-square (to test for

relationships between

nominal variables)

Concept: Discrete

Variables (refresher)

Discrete, or nominal, variables are those measures that fit into categories.

- Religious groups

- Color car you drive

- Whether you voted

- Region of the country where your parents live

- Current dorm

Concept:

Independence

A chi square test is the test used to test the association between nominal variables. Importantly, it is a non-parametric test, meaning that it makes no assumptions about the distribution of variables (e.g., normally distributed, etc.)

Concept:

Bi-variate tables

(or Cross-Tabs)

Before we discuss the chi-square test, we need to consider the construction of a cross-tab. A cross-tab (or cross-tabulation) is simply a bi-variate table showing the relationship between two discrete variables in your data. Bi-variate simply refers to two (bi-) variables (variate).

Source: General Social Survey

Rows!

Columns!

Row Marginals!

Column

Marginals

Two variables are said to be independent if the classification of an observation into the category of one variable has no effect on the probability (or likelihood) that the observation will fall into a category of another variable. In other words, knowing something about where an individual falls on one variable (e.g., hair color) tells us nothing about where they are likely to fall on another variable (e.g., did you vote).

Examples of variables that we

might imagine to be independent.

- Gender (M/F) and Metropolitan Status

(e.g., urban, suburban, rural)

- Gun ownership (Yes/No) and Religious

Denomination (e.g., Catholic, Protestant, etc.)

- Majority religion in a country and

landlocked status (Yes/No).

Examples of variables that we might

imagine not to be statistically independent.

- Gender and Romney/Obama

- Age group and voting

- Race and Ward of DC where you live.

- Whether you were in the top quartile of

your high school class and whether you are

in the top quartile of your college class.

- Religion and number of children

Concept:

Chi Square

The chi square test is a test of joint occurrences.

We want to know if the categorization of an observation on one variable (e.g., gender) is independent of the categorization that observation on another variable (e.g., political ideology).

Expected Frequency

: The cell frequencies we would expect to find on account of random chance. This is the cell frequency we would expect to find if the variables were fully independent.

Observed Frequency

: The actual frequency observed in the bi-variate table.

Test statistic: Chi Square Obtained

Steps for the

Chi Square Test

1. Test Assumptions:

One population from a random sample

The level of measurement is nominal/ordinal

Expected frequency of each cell >=5

2. State the null hypothesis (that

the variables are independent)

3. Select your level of alpha. Note your

degrees of freedom. Determine your

critical value.

Degrees of Freedom:

df = (r-1)(c-1)

4. Compute the Chi Square test statistic,

using both the expected frequency and

the observed frequency.

5. Make a rejection decision.

6. Interpret your results.

review: confidence intervals,

one-sample t-tests, one-tail vs.

two-tail tests, Type I vs. Type II

errors, critical values.

#1: Do women at Georgetown have higher

GPAs than men at Georgetown?

#2: Do children raised in heterosexual, two-parent

households have better educational outcomes than

children raised in same-sex, two-parent households?

#3: Do poor children watch more hours of TV

each week than middle-class children?

Framework A: Two separate groups,

one point in time.

Framework B: One group of people,

two points in time (i.e., before and

after an intervention); matched-pair

sampling.

#1: Does watching a set of campaign ads

change the amount that voters support

President Obama?

#2: Do diversity programs

increase tolerance and acceptance

among high school seniors?

Rather than testing whether a sample mean is equal to a target number, we are testing whether the mean of

one group (e.g., men) is equal to the mean of another

group (e.g., women).

What is the null hypothesis

for a test of the difference in

means (t-test)?

The formula for the standard error in the difference in means looks more complicated than the formula for the standard error with a single sample ...

However, as you'll see, it requires the same set of inputs -

the sample size (n) and the standard deviation for both samples.

The steps for conducting a t-test to test the difference in means between two groups are basically the same for testing with a single sample (last week).

Stata for two-sample

Difference in Means

An important part of the ANOVA is considering the group mean vs. the grand mean.

The

group mean

is (quite simply) the mean for each group being compared (e.g., mean of sophomores, mean of juniors, etc.)

The

grand mean

is the mean of the entire sample (i.e., sophomores, juniors and seniors together).

Concept: Group Mean

vs. Grand Mean

For the entire sample (n=276), mean = 3.89

For our research question, the null hypothesis

is that the mean political ideology score for

Democrats is equal to the mean political ideology

score for Republicans and equal to the mean

political ideology score for Independents.

Mean - Democrats

Mean - Independents

Mean - Republicans

Total Sum

of Squares

Individual Score

Overall (or

Grand) Mean

Group mean (for each group)

Grand mean

The total sum of squares is made up of two factors:

the within sum of squares and the between sum of squares

Practice Problem

Concern:

One of the challenges of the ANOVA is that when you reject the null hypothesis (of equal means), you accept an alternative hypothesis that is vague (that the means are not equal). Typically, you can't just "eyeball" the means to tell which ones are the results of true differences, and which result from chance alone. There are no two-tail tests or directional hypotheses.

df - within

df - between

critical values

Concept:

Null Hypothesis

Grand Mean

Concept: General

Linear Model

The general linear model states that the best prediction of the dependent variable for any particular case is equal to the mean score plus the effect of any independent variable.

If I randomly select one person from the population,

what is my best prediction of her political ideology score (knowing absolutely nothing about her)?

However, if I then learn that she is a Republican, I can

adjust my predicted score to account for the "additive

effect" of being a Republican in my data - essentially,

the difference between the group mean for Republicans

and the grand mean for the whole sample.

This score - the grand mean plus the added effect (+/-)

of being a Republican - is now the best prediction of her

score, but I also have to acknowledge error. There are

other factors (that I don't know) that will cause her

score to deviate from this predicted mean.

The general linear model basically decomposes a

predicted score into three parts: the mean score for the whole sample, the additive effect from particular

categories, and the error term.

Democrats: Mean Ideology Score = 3.23

Independents: Mean Ideology Score = 3.90

Republicans: Mean Ideology Score = 4.70

Overall Sample: Mean Ideology Score = 3.89

Research Question: Do these differences

represent true population differences, or is

it likely that they are the result of chance alone?

Group Means:

High School - 3.67

Young Adult - 4.42

Middle-aged - 5.58

Retired - 7.92

Grand Mean: 5.40

The best prediction of her

score is the group mean!

These results are from a random sample of respondents to gauge their interest in civic affairs. The scores range from no interest (0) to high levels of interest (10). Researchers wanted to know whether the level of interest in civic affairs varies by age.

So far, we have done hypothesis testing

for continuous variables across discrete

categories ...

One category - one-sample hypothesis test

(e.g., mean age when people have their first

child is equal to 23)

Two categories - two-sample t-test

(e.g., mean age when people have their first child

differs for men and women)

Three or more categories - ANOVA, F-ratio

(e.g., mean age when people have their first child

differs across racial groups - Black, White, Asian,

other)

Today, we will turn to an analysis

of discrete (ordinal/nominal) variables

across a range of categories (ordinal/nominal).

Top Hat:

Write down a pair of discrete variables

that you would expect to be statistically

independent of one another? In other

words, knowing your category in one of

the variables tells us nothing about your

likelihood of being in a particular category

of the other variable.

Top Hat:

What are some pairs of discrete variables

that you would not expect to be statistically

independent of one another? In other words,

name a pair of discrete variables where

knowing your category on one variable

tells us something about what category you

fall into for the second variable.

Cells!

(Showing Joint Frequency)

Column Percentages

(calculated by dividing

the joint frequency by

the column marginal; in

other words, the number

of extremely liberal women

divided by the total number

of women.)

Column Percentages

The null hypothesis is that two variables (in a bi-variate table) are independent of each other. If the null hypothesis were true, then we would expect the cell frequencies to be the result of random chance alone.

46160

Grand Total

(Total number

of observations)

Compared to your

Chi Square Critical

Values

Practice Problem

#1: Researchers want to determine whether

homeowners and renters vary in their support

for stronger gun control laws. They sample

656 people and observe the following frequencies.

- 308 Homeowners favor stronger gun control

- 119 Homeowners oppose stronger gun control

- 175 Renters favor stronger gun control

- 55 Renters oppose stronger gun control

Setting alpha equal to 0.05, determine whether

there is a relationship between homeownership

and support for gun control in America.

#2: Researchers want to determine whether

support for stronger gun control laws differs

according to which candidate individuals supported

in the 2008 election. They sample

810 people and observe the following frequencies.

- 379 Obama supporters favor stronger gun control

- 194 McCain supporters favor stronger gun control

- 12 supporters of other candidates favor stronger gun control

- 92 Obama supporters oppose stronger gun control

- 123 McCain supporters oppose stronger gun control

- 10 supporters of other candidates oppose stronger gun control

Setting alpha equal to 0.05, determine whether

there is a relationship between which candidate

a person voted for and support for gun control in America.

Top Hat:

In our political data, what is the expected frequency for conservative men?

What is the expected frequency for extremely liberal women?

And what is observed frequency for conservative men? For extremely liberal women?

**Correlation**

We often talk about social phenomena that are correlated.

When we discuss correlation, we're considering two continuous measures that

co-vary

- or that vary together.

When the value of one variable systematically changes as the value of the second variable change, we say that the two variables are

correlated.

Height, Shoe Size and the

Amount of Money in your Wallet

**Concept:**

Scatter Plot

Scatter Plot

**Concept: Pearson**

Correlation Coefficient

Correlation Coefficient

**Concept: Direction**

**Concept: Strength**

**Concept: Correlation**

vs. Causation

vs. Causation

**Concept: Linear**

Relationship

Relationship

**Concept: Curvilinear**

Relationship

Relationship

**Concept: Coefficient**

of Determination,

r-squared

of Determination,

r-squared

**Concept: Residual**

Variance

Variance

**Concept:**

Best Fit Line

Best Fit Line

A scatter plot is a two-dimensional graph that shows

the coordinates between two variables - X and Y - for

all the observations in a data set. It provides visual evidence to assess whether two variables are correlated.

As the size (measured in carats) of a diamond goes up,

the price goes up. We would say that size and price are

positively correlated.

As reading scores increase, writing scores increase, as well.

We would say that reading scores and writing scores are

positively correlated.

Each dot on the scatter

plot is a different observation

in our data (in this case, each

dot is a different student

in our data)

Scatter plot of height and shoe size.

Scatter plot of height and money in your wallet.

Two continuous variables - X and Y - can

be said to be related in one of two ways:

1. Positive Correlation.

- When the value of X increases, the value

of Y increases.

2. Negative Correlation.

- When the value of X increases, the value

of Y decreases.

Top Hat

: An example of two variables that are

positively correlated? (Remember: Both variables

must be continuous!)

X: Hours studied

Y: Score on an exam

X: Temperature

Y: Number of people going to the beach

for the weekend

X: Number of ice cream cones sold

each day

Y: Number of bottles of water sold

each day

Top Hat:

An example of two variables that

are negatively correlated? (Remember: Both

variables must be continuous.)

X: Number of books read

Y: Hours spent watching TV

X: Hours slept

Y: Time spent socializing

When a change in one variable has no relationship with a change in a second variable,

we say that that variables are uncorrelated.

When no correlation exists, a change in X is unrelated to a change in Y.

In addition to noting the direction of a correlation, we can talk about how strong the correlation is.

For example, shoe size and height are very strongly correlated. We can have a pretty good guess about what your shoe size is when we know your height.

Other variables have an association, but the correlation is much weaker. For example, we might know that hours slept is weakly correlated with exam scores. There is a relationship between them, but it is not a particularly powerful.

For countries around the world, what do you think the relationship is between average life expectancy and mean number of years of schooling?

- Positive correlation?

- Negative correlation?

- Uncorrelated?

How do I know that this is the line that

"best fits" the data?

There are an infinite number of

lines that I could draw through the

data. How do I know which one

is the "best fit" line?

The "best fit" line.

This line is the mean

of years of schooling

(7.2) for the sample.

Regression line

Without the best fit line, our best guess of the mean number of years of schooling would simply be the mean of the sample.

However, the best fit line helps us to more accurately guess the mean years of schooling when we know the life expectancy of a country. Because the variables are correlated, knowing something about X tells us something about Y.

**Concept:**

Predicting Y

Predicting Y

One of the main reasons we look for correlations is that it helps us improve our prediction of Y. Without any other information, our best guess of the value of Y for any variable is the mean of the sample.

Note: Pearson's r always ranges from -1 to 1.

The sign indicates whether the variables are positively or negatively correlated.

The value (absolute value) indicates the strength of the correlation.

-1 indicates a perfect negative correlation

1 indicates a perfect positive correlation

0 indicates that variables are uncorrelated

Example: I want to predict the shoe size for a random person. If I know nothing about that person, my best guess of that person's shoe size is the sample mean. However, if I know something about that person's height,

I can make a more accurate prediction. Height and shoe size are correlated.

Example: I want to know how many years of schooling an individual has completed. Knowing nothing else about that person, my best guess for the number of years of schooling completed is the sample mean. However, if I know something about their mother's level of education, I can make a better prediction of their education because your education level and your mother's education level are correlated.

Example: I want to predict the average number of years of schooling completed in country A. If I know nothing else about country A, my best guess of the average number of years of schooling would be the mean for the sample. However, if I know the GDP in the country, I can make a better guess because average life expectancy and GDP are correlated.

The "best fit" line is the line

that minimizes the amount of

error between each observation

and the regression line.

Later, we will talk about

minimizing the sum of the

squared error.

For the moment, suffice it to say

that the "best fit" line is the line

that best reduces the amount

of error between each observation

and the line.

For each observation, the difference between the observed value and the predicated value is the error term.

Practice Problem: Five individuals report the number

of hours of TV they watch and the number of hours

they spend reading. Calculate the correlation coefficient

for hours of TV watched and hours spent reading.

Bill: 5 hours of TV, 11 hours reading

Michelle: 7 hours of TV, 8 hours reading

Anne: 8 hours of TV, 5 hours reading

Hillary: 7 hours of TV, 6 hours reading

George: 3 hours of TV, 10 hours reading

Start with a simple scatter plot. What does the scatter plot tell you about the correlation between hours spent watching TV and hours spent reading?

Calculate the correlation coefficient.

However, when I know something about a variable

that is correlated with our outcome - in this case, Y

- then I can make a better prediction.

Shoe Size

In our data, let's calculate Pearson's

r for the relationship between

shoe size and height. (Any guesses?)

Now, let's calculate Pearson's r for

the relationship between money

and height. (Any guesses?)

Linear regression is the bread & butter of social science research. If you can master linear regression, you have the basic building block for more advanced topics in quantitative social science.

Predicted value of Y

Y-intercept

Slope of the regression line

(through the scatter plot)

The observed value of Y

Error term (or the difference

between the observed value of

Y and the predicted value of Y)

The formula for the Y-intercept:

Y-intercept

The mean of Y

The slope of the

regression line

The mean of X

The Y-intercept is the place where the regression line crosses the Y-axis.

We often refer to this as the

constant

.

We can also think of the Y-intercept as the value of Y when X=0.

The formula for b, the regression coefficient, is:

The numerator is the covariation of X and Y -

a measure of how much X and Y vary together.

The regression coefficient is simply the slope of the line

that runs through your scatter plot. (From previous classes, you may be familiar with the idea of a slope as

rise over run ... )

Interpreting a regression coefficient:

1. Talking about how one variable co-varies alongside another: We typically say a one-unit change in X is associated with a b-unit change in Y.

2. Talking about the predicted value of Y, based on the regression line.

**Concept: Plotting**

the Regression Line

the Regression Line

An example using real data ...

After calculating the Y-intercept (a)

and the regression coefficient (b), we

have enough information to overlay

our regression line atop a scatter plot.

Did you notice a similarity in the formulas for

the regression coefficient (b) and the correlation

coefficient (r). The correlation coefficient is

basically for the standardized scores, rather than

the raw scores. If we plotted the standardized

scores on a scatter plot, our best fit line would

have the slope of the correlation coefficient.

Mother's Education (Mean): 11.662

Respondent's Education (Mean): 13.907

Regression coefficient, b: 0.437

Each point is an observation from our dataset.

The observed value for that point is Y.

The predicted value for that point is .

The different between the observed value and

the predicted value is the error, e.

e

e

**Concept: Proportional**

Reduction in Error (PRE)

Reduction in Error (PRE)

Knowing nothing about a particular observation,

we know that our best guess of their level of education

would have been the sample mean. When we know something about that person's mother's level of education, we can use our regression line to make a better prediction about our respondent's level of education. We can make a better prediction of Y (education) knowing something about X (mother's education), but

how much better will our prediction be?

To calculate the PRE, we first make a prediction assuming we know nothing about the independent variable, X.

Then, we make another prediction using the information we know about the independent variable, X.

Quite simply, the PRE tells us the proportional reduction in errors when we know X vs. when we don't know X. How much better did we do in predicting the outcome when we know X than when we didn't know X?

In this case, how much better did we do in predicting respondent's education when we know mother's education, rather than when we only know the sample mean?

**Concept:**

Outliers

Outliers

What is an outlier?

How does it affect the

mean of a distribution?

In regression analyses, outliers can "pull"

the regression line up, leading the regression

line to "misfit" the data.

One of the advantages of the scatter plot

is that you can visually see the outliers in

your data. There are tools - called regression

diagnostics - that we can use for evaluating

the presence and the impact of outliers

in our regression analysis.

For starters, we will use a linear regression to talk about

how a change in the level of one continuous variable

is

associated

with a change in the level of another

continuous variable. In doing so, we can predict the

level of our dependent variable (Y) with information

from our independent variable (X).

Today, we will start making a distinction

between our independent and dependent variables.

Our dependent variable - the one we

want to predict - is Y. Our independent variable -

the one we're using to make that prediction - is X.

Examples:

1. If we want to know how mother's level education is associated with the level of respondent's education, then mother's education is X and respondent's education is Y.

2. If we want to know whether the number of violent crimes in a neighborhood is associated with rates of passing exams in neighborhood schools, we are using number of crimes (X) to predict passing rates on exams (Y).

3. If we want to know whether the on-time arrival percentage airlines predicts the number of customer complaints airlines receive, the the percentage of on-time arrivals is our independent variable (X) and the number of customer complaints is our dependent variable (Y).

Top Hat:

If I want to predict a person's income,

give me an example of a continuous independent

variable (X) that might be related to a person's

income (Y).

What is the unit of analysis in this problem? (Not on

Top Hat)

If I picked a random person, and knew nothing else

about that person, what would be my best guess

of that person's income? (Not on Top Hat.)

Let's start with an easy example ...

X = Percentage of flights that arrive on-time

Y = Number of complaints received

Airline 1: X = 80%, Y=200

Airline 2: X = 40%, Y=210

Airline 3: X = 90%, Y=140

Airline 4: X = 60%, Y=230

Airline 5: X = 70%, Y=130

**Concept: Testing**

the Significance of

the Pearson's Correlation

Coefficient, r

the Significance of

the Pearson's Correlation

Coefficient, r

As with other statistics obtained from a sample, we

may want to test when our correlation coefficient, r,

is statistically significant. In other words, does the

linear relationship between X and Y truly exist in

the population, or is it the result of sampling error?

Null Hypothesis:

Test statistics (to be used with critical values

in the t-table):

Calculate the Y-intercept (a)

and the slope of the regression

line (b)

Question: If you knew nothing about the

percentage of on-time arrivals for a particular

airline, what would be your best guess of the

number of customer complaints it received?

Question: If you knew that an airline had an

on-time arrival rate of 75%, what would be

your best guess of the number of customer

complaints it received? (Calculate Y-intercept,

calculate regression coefficient.)

The regression line is also the line

that

minimizes the sum of the squared

error terms

.

If you took each error term - the difference

between the observed value of Y and the

predicted value of Y - and squared them (to

make them all positive) and added them up,

there is no other line you could draw that

would make that value smaller.

In statistics, we often call this

Ordinary

Least Squares (or OLS)

regression.

Slope (b) = 1.466

Y-Intercept (a) = 10.90

Slope (b) = 0.24

Y-Intercept (a) = 78.11

When we calculate a regression line, it's worth asking,

"How much better are we at predicting Y when we know

X than when we don't know X?"

In other words, do we get a better prediction of respondent's education (Y) when we know something about mother's education (X) than when we don't?

How much better are we at predicting a country's life

expectancy rate (Y) when we know the mean education level (X) rather than when we know nothing about the level?

The coefficient of determination tells us how much of the variation in Y can be "explained" by its relationship to X.

It is, quite simply, the square of Pearson's correlation coefficient, r. We typically refer to the correlation of

determination as "r-squared".

Explained variation in

Y. (Explained by the

regression line.)

Total variation in Y.

The observed values do not fall close to

the regression line. The regression line

is not doing a great job explaining the

variation in Y. There is still quite a bit

of error.

The observed values fall quite close to the regression line. The line does a good job of explaining the variation in Y. There is relatively little error in this figure.

Low R-Squared

High R-Squared

**Multiple**

Regression

Regression

**Concept: Interpreting the**

Regression Coefficients

Regression Coefficients

Y-Intercept (a)

Slope (b)

(also known as your

regression coefficient)

Coefficient of determination

test statistics

and p-values

(test statistics

test whether the

coefficient is different

from zero)

Dependent variable

Independent

Variable(s)

How do we interpret the regression coefficients?

What do they mean?

First, we rarely interpret the Y-Intercept (a).

It's not a very meaningful statistic.

However, the regression coefficient is extremely important. To interpret the regression coefficient,

we usually say: "A one-unit change in X is associated with a b-unit change in Y," where b is the regression coefficient.

"A one-unit increase in mother's education is associated with negative 0.083-unit change in the number of hours of television watched."

Or, less jargony ...

"When the mother's level of education goes up by one year, we would expect the number of hours of television watched to go down by about 0.083 hours."

**Concept:**

Predicting Y

Predicting Y

Sometimes, we'll call the regression equation

the "prediction equation" because it allows us

to make a prediction of Y, given our knowledge

of X.

Take this prediction equation:

Top Hat:

What is the predicted level of education

for someone whose father has fifteen years

of education?

Top Hat:

What is the predicted level of education

for someone whose father has eight years

of education?

Top Hat:

What is the predicted level of education

for someone whose father has zero years

of education?

When we talk about multiple regression - or multivariate regression - we are simply talking about adding more predictor variables to the equation. Instead of a single

independent variable, we will now have multiple independent variables (multiple Xs)

Why would we want multiple predictor variables?

1. Knowing more information about our observations will help us make better prediction decisions.

When X and Y are correlated, knowing something about X enables us to make better predictions about Y. When there are two independent variables that are correlated with Y, it often helps us make even better predictions. And so on and so on ...

2. The world is messy! All dependent variables are influenced by many things.

For example, if I want to predict your income (Y), there are lots of important independent variables to consider - your level of education, your gender, the prestige of your university, your parents' income, the type of industry where you work, etc.

I could run bivariate regressions for each of these, but as you'll see, it's better to put them into a single prediction equation.

3. We want to know if the relationship we observe in the bivariate framework could, in fact, be explained by the addition of an additional variable.

In other words, we want to know whether there is a

direct relationship

- X---->Y - or whether the relationship is

spurious

- namely, a third variable (Z) is causing the change in both X and Y.

Concept: Partial

Correlation Coefficient

Concept: Partial

Slope Coefficient

The bivariate correlations we've calculated are known

as

zero-order correlations

. The correlation between X and Y is a zero-order correlation.

Partial correlation coefficients take into account the influence of a third variable - Z - when figuring out the correlation.

What is the zero-order

correlation between the amount

of housework done by the husband

and the number of children in the

household.

Correlation of Y and X = 0.50

Now, we want to know if the correlation between housework and children is affected by the husband's

years of education.

Partial correlation coefficient:

Correlation of X and Y, controlling for Z

Correlation of Y and X = 0.50

Correlation of Y and Z = - 0.30

Correlation of X and Z = - 0.47

Concept: Multiple

Regression Equation

Multiple predictors of

our dependent variable

For example, we might want to know how your level of education (X1), your parents' income (X2) and your college GPA (X3) influence your post-college income (Y)

Bivariate Linear Regression Equation

Multivariate Linear Regression Equation

# of ice cream cones sold # of fires in the city

# of fires in the city

temperature/heat

# of ice cream cones sold

A couple ways to think about partial correlation:

What is the correlation for X and Y

controlling

for Z?

What is the association between X and Y when we factor out the influence of Z?

What is the correlation between X and Y independent of the influence of Z?

These are three ways of saying the same thing!

Back to our ice cream cone example, the zero-order correlation between ice cream cones sold and the number of fires would be positive (and strong).

However, controlling for the temperature, we would expect the partial correlation of these two variables to be zero. Factoring out the influence of the temperature, there should be no association between ice cream cones sold and the number of fires in a city.

Top Hat:

Name three independent variables (X) that might be associated with the amount of money an individual has in his or her retirement account (Y)?

Slope coefficient

in the bivariate case

Slope coefficient in the multivariate case

(when there are 2 independent variables)

Y-Intercept in the

multivariate case

(when there are 2

independent variables)

Correlation of Y and X = 0.50

Correlation of Y and Z = - 0.30

Correlation of X and Z = - 0.47

Standard deviation of Y?

Standard deviation of X?

(Good practice for the final!)

Concept: Interpreting

Multiple Regression Analysis

In the bivariate framework, we talked about the

impact of a one-unit change in X on the predicted

value of Y.

In the multivariate framework, we are beginning to look at the simultaneous impact of multiple predictors on the predicted value of Y.

Top Hat:

Using counties as the unit of analysis,

we want to study the relationship between education (measured by the % of people with a high school degree) and the level of crime (measured by the # of crimes per 10,000 people). Would you expect the number of crimes committed in a county (Y) to go up or down as the percentage of people holding a high school degree (X) increases?

Top Hat:

In one sentence, interpret this prediction equation.

Top Hat:

What is the predicted number of crimes

per 10,000 people in a county where half

of the population holds a high school degree?

Why is that? Why does the regression equation show an increase in the number of crimes as education increases when we would expect the opposite.

The relationship is spurious. (Remember: The ice cream cone/bottle of water example!) There is a third variable correlated with both of these ... but what is it?

Urbanization

(% of people living

in an urban area)

Crime

(# of crimes

per capita)

Education

(% of people holding

a high school degree)

+

+

Partial Regression Coefficient:

Standard Deviation, Crime = 28.193

Standard Deviation, Education = 8.859

Standard Deviation, Urbanization = 33.969

Interpretation?

Predicted number of crimes per 10,000 people

when 50% of people live in cities and 40% of

people hold a high school degree.

Notice: When we control for the urbanization rate in our

regression analysis, the sign on education flips from positive to negative. Controlling for the level of urbanization in a county, the percentage of high school graduates is negatively related to the number of crimes committed!

How do we interpret the multiple regression coefficients?

Controlling for the level of urbanization in a county, a one percentage-point increase in the number of high school graduates is associated with a decline of 0.58 crimes committed per 10,000 people.

Second example: Using state-level data,

we want to predict the relationship between

the percentage of poor people in a state and the

violent crime rate.

Expectations?

An outlier!

What would happen

to the regression line, the

correlation coefficient (r)

and the R-squared value

if I took that data point

out of the analysis?

Why?

Correlation, Violent Crime & Poor = 0.509

Correlation, Violent Crime & Single-Parent = 0.839

Correlation, Single-Parent & Poor = 0.0549

Standard Deviation, Violent Crime = 441.103

Standard Deviation, Poor = 4.584

Standard Deviation, Single-Parent = 2.121

b = 6.787

Interpretation?

Instead of adding in the percent single-family, what would have happened if I added a variable into the model that was highly correlated with the violent crime rate (Y) but uncorrelated with the first predictor variable (percent poor)?

The answer - there is no impact on X1 if you include an X2 that is uncorrelated with it!

Concept: Inserting Dummy

Variables into a Regression

Although we introduced regression analysis in the context of continuous measures, we can also put dummy variables - or dichotomous variables - into the model as predictors.

The interpretation of coefficients is analogous to the interpretation with continuous variables. A one-unit change in X - in this case, from "0" to "1" - is associated with a b-unit change in Y.

Dichotomous variables:

Black/white

Male/female

Over 25/under 25

In college/not in college

Concept: Standardized

Regression Coefficients

When we interpret regression coefficients, we interpret them in their original unit of measurement.

e.g., Education measured in years of education

e.g., Income measured in dollars

e.g., GPA measured in GPA points

e.g., Homeownership rate measured in percent homeowners

Why might we want standardized coefficients?

(And what are standardized coefficients?)

Think back to Z-scores ...

We said that every score has a raw score and a standardized score - or a z-score.

e.g., I got an 85 on the exam; my Z-score was 0.57.

e.g., My GPA is 3.20; my Z-score is -1.25

Z-scores allow us to compare across distributions. In this case, Z-scores allow us to compare across regression coefficients to determine which independent variable is a stronger predictor of Y.

Standardized Regression Coefficient

Standard Deviation, Crime = 28.193

Standard Deviation, Income = 4.682

Standard Deviation, Education = 8.858

Standard Deviation, Urbanization = 33.969

Interpretation: A one standard deviation change in X is associated with a Beta standard deviation change in Y.

Advantage: Compare across independent variables whose underlying units of measurements are different.

Disadvantage: Lacks the intuitive interpretation of unstandardized regression coefficients.

Standardized

Regression

Coefficients

**Causation**

Inferential statistics:

Are the differences observed in a sample

the result of true population differences,

or are they the result of chance alone

(sampling error)

Association:

We've asked how a change in one variable (X)

is associated with a change in another variable (Y).

e.g., Regression: about correlation; a one-unit change in X is associated with a b-unit change in Y.

e.g., Chi-Square: a test of statistical independence,

whether two discrete variables are related.

An oft-repeated mantra in the social sciences

is that

correlation does not imply causation

.

In other words, just because two variables

are correlated (there is an association) doesn't

mean that a change in one variable causes

a change in another variable.

Think: Ice cream sales and bottles of water sold

in the city. We know that they're associated - as

the number of ice cream cones sold rises, so too

does the number of bottles of water sold. However,

increasing the number of ice cream cones sold

does not

cause

the number of bottles of water sold to

increase!

Correlation does not equal causation.

**Three criteria for causality:**

**1. An association.**

**2. Time ordering.**

X must come before Y.

X must come before Y.

Example: Boys that are in the Boy Scouts have

lower rates of juvenile delinquency (e.g., arrests)

than those that are not in the Boy Scouts.

Am I less likely to be delinquent because I joined

the Boy Scouts, or did I join the Boy Scouts because

I'm unlikely to be a delinquent?

Here, the temporal ordering - which came first - is

ambiguous.

Example: Students that participate in programs

at the Center for Social Justice (CSJ) are more

likely to advocate for social issues.

Do I participate in CSJ programs because I'm an advocate

for social issues, or do I advocate for social issues because

I participate in CSJ programs?

Again, the temporal ordering is ambiguous.

**3. Elimination of**

Alternative Explanations

What work have you done to eliminate other explanations of the association?

Alternative Explanations

What work have you done to eliminate other explanations of the association?

When we're flying and the fasten seat belt

sign goes on just before it becomes turbulent,

we've satisfied the association requirement

(when the light's on, there's more likely to be

turbulence) and the temporal ordering (the

light came on before the turbulence happened),

but we know that one did not cause the other!

We have a

spurious

association, or a

third variable

problem!

If the weather looks turbulent (Z), the pilot is likely to

turn on the fasten seat belt sign (X) and there is likely

to be turbulence (Y). Therefore, the causal factor is

weather!

When we can measure the third variable,

we can control for it in our multiple regression

model (e.g., Urbanization was the third variable

(Z) that explained the relationship between

education and crime).

Example: I administer the same math test

to all of the kids in a middle school, and I find

a strong association between a student's height (X)

and his/her score on the math exam (Y). This

satisfies both the association and the temporal

ordering criteria ... but have I ruled out all the

alternative explanations? What might my third

variable be?

Age!

In this case, older students tend to be taller, as they've

had more years to grow.

Older students tend to do better at math, as they've had

more years to study.

Sometimes, the spurious (or third) variable is very

difficult to measure.

GPA --------> Lifetime Income

Does a GPA really cause higher incomes?

Or is there a third variable that's difficult to measure that causes both GPA and income to rise?

We often call these variables

unobservables.

What are some variables that are difficult to measure?

What variables might we have a hard time observing

in the population?

Beside spurious associations, we sometimes encounter

intervening

variables.

Here, we find that A causes B, and

B causes C, but we might only observe A and C.

A --------------------------- > C

Education -----------------------> Longer Life Span

Why does more education make people live longer?

A -----------> B -----------> C

Education -----> Increased Income --------> Longer Life Span

What's the reason?

**Summary:**

Spurious/Third Variable

Intervening Variable

Spurious/Third Variable

Intervening Variable

**Age**

**Height**

**Math Scores**

**Education**

**Income**

**Life Expectancy**

**Experimental Studies:**

The

Gold

Standard

for Causal Inference

The

Gold

Standard

for Causal Inference

Why are experiments so important for

making causal claims?

**Experimental Group**

vs.

ControlGroup

vs.

ControlGroup

With randomization, you can basically control for unmeasured difference between the two groups! As a result, researchers can control any difference

between the groups

**Observational Studies:**

More typical of social

science research

More typical of social

science research

Example: I want to know whether homeowners

make for better citizens than renters. Do they vote

more often, volunteer in their communities, and

join community groups?

What are some differences (on average) between

homeowners and renters that we can observe?

What are some differences between homeowners

and renters that would be more difficult to observe?

**Income**

**Homeownership**

**Voting**

**Community-Minded**

**Homeownership**

**Voting**

**Observed:**

**Unobserved:**

**Final Question: Why do we care**

so much about causality?

so much about causality?

Weather looks

turbulent

Pilot Turns Seat

Belt Sign On

Passengers Experience

Turbulence

**How hard you work?**

How much grit you have?

Your ability to focus?

How much grit you have?

Your ability to focus?

**Reading "Causal" Headlines**

1. What’s the difference between a

one-sample t-test

and a

two-sample t-test

? Give an example of a research question in which researchers would chose each kind of test

.

2. Explain (in words) why researchers would want to run a two-sample t-test.

3. Can you ever know if you’ve committed a

Type I error

or a

Type II error

?

Why or why not?

4. Explain the concept of a

p-value

. When you run a one-sample t-test, what is the p-value telling you?

5. Explain the relationship between your

critical value

and your

p-value

.

6. What is the difference between the group mean and the grand mean in an

ANOVA

?

7. Researchers are going to test for differences in the average weight of preschool children in four different neighborhoods in Washington DC. State the

null hypothesis

and the type of test you would use.

8. What are the two things that

Pearson’s r

tells you about the relationship between two continuous variables?

9. How do you calculate the

expected frequencies

required for a Chi-Square test?

10. What does it mean for two variables to be

statistically independent

of each other when you're running a Chi-Square test?

11. What do researchers mean when they say that there is a

linear relationship

between X and Y?

12. Explain what "least squares" refers to in an

Ordinary Least Squares

regression.

13. Give an example of a

spurious relationship

that does not involve ice cream cones.

14. Explain why researchers never accept a null hypothesis (and instead, either reject or fail to reject).

Let's make a cross-tab for gender & year in this class.

In this class, there are ...

X male sophomore

X male juniors

X male seniors

X female sophomores

X female juniors

X female seniors

The General Social Survey gave

1,395 people the following statement and asked

them, "Do you strongly agree, agree, disagree, or

strongly disagree with the statement."

The statement is: " In the United States traditional divisions between owners and workers still remain. A person's social standing depends upon whether he/she belongs to the upper or lower class." Researchers want

to know whether agreement with this statement varies

between men and women.

Of the men in their sample, they found that 88 strongly agreed, 319 agreed, 162 disagreed, and 30 strongly disagreed. Of the women in their sample, 142 strongly agreed, 416 agreed, 189 disagreed, and 54 strongly disagreed.

They hire your group as research assistants for the project. Based on this information, what would you conclude? Are there differences between men and women on their agreement with the statement about class divisions?

Concept: Statistical Significance

Under the assumption of no effect (in other words, if the null hypothesis is true), the p-value tells you the probability of obtaining a test result that is equal to - or more extreme - than the actual value that you observed.

In hypothesis testing, alpha simply identifies a proportion of the curve (e.g., 0.05) at which we would reject the null hypothesis. In other words, it would be unlikely to find a test statistic in this region if the null hypothesis were true.

For each value of alpha, we can find a Z-score. Think of this as the cut-off point. This is the critical value. For any test statistic that is beyond this critical value, we will claim that our findings are

statistically

significant

.

When reading social research, you will often see stars (*) indicating the level of confidence in the estimates. Typically, * p< 0.10, ** p < 0.05, and *** 0.01. Those stars are associated with the level of confidence with which researchers can make their claims (aka, their statistical significance).

In this case, * means that we can reject the null hypothesis at a level of confidence of 90 percent. ** means that we can reject the null hypothesis at a level of confidence of 95 percent. *** means that we can reject the null hypothesis at a level of confidence of 99 percent.

These stars for significance give us a short-hand way of identifying how confident we can be in rejecting the null hypothesis.

Researchers are interested in the topic of abortion as a political issue. They are thinking about differences between Americans who identify as lower- or working-class, and those who identify as middle- or upper-class.

The General Social Survey asked Americans how concerned they are with abortion. On a scale of 1-4 (which we will treat as a continuous variable, with "1" being not concerned at all and "4" being very concerned), they found that the average score for respondents who identified as lower- or working-class (n=292) was 2.016 and the average score for respondents who identified as middle- or upper-class (n=217) was 1.959. Researchers also report that the standard error for the difference in means is 0.03.

Test whether or not lower- and working-class Americans are more concerned with abortion as a political issue than middle- or upper-class Americans.

In making rejection decisions about the null hypothesis, we have settled on two ways of making those decisions: comparing the p-value to alpha OR comparing the critical value to your test statistic.

We said that alpha is the level of error you're willing to accept, or the likelihood of making a Type I error.

When drawing random samples, it is possible - but unlikely - that we are going to draw a sample that falls far from the population mean. (Remember, 95% of random samples will fall +/- 2 standard deviations from the mean. [Technically, they will fall +/- 1.96 standard deviations from the mean, but we say 2 standard deviations because it's easier to remember!])

A Type I error occurs when we reject the null hypothesis, even though the null hypothesis is actually true.

When we conduct non-parametric tests by hand (e.g., Chi-Square), we will compare the test statistic (e.g., Chi Square) to the critical value to make a rejection decision. The Stata output also includes the p-value, which we can compare to alpha to make our decision.

In a two-tailed test ...

When alpha = 0.05, our critical value = 1.96

When alpha = 0.01, our critical value = 2.58

In a one-tail test ...

When alpha = 0.05, our critical value = 1.64

When alpha = 0.01, our critical value = 2.33

Imagine in a t-test testing for the difference in means between two samples and using a two-tail test, you get a test statistic (t) equal to 2.15.

Can you reject the null hypothesis with 95% confidence?

Can you reject the null hypothesis with 99% confidence?

Imagine in a chi-square test with four degrees of freedom, you get a test statistic (chi-square) equal to 10.44.

Can you reject the null hypothesis (that the variables are statistically independent) with 95% confidence?

Can you reject the null hypothesis (that the variables are statistically independent) with 99% confidence?

What is the relationship between the correlation coefficient (r) and the slope of the regression line (b)?

It seems that they both tell us something about the relationship between X and Y - two continuous measures.

When we made a scatter plot, we plotted the raw scores - your height in inches, years of education in years, weight of a diamond in carats, etc. When we use the raw scores in a scatter plot, the slope of the regression line (or best fit line) is b.

However, remember that every score has a raw score AND a standardized score (Z). The standardized score tells us tells us something about the score, relative to other scores in the distribution.

If we plotted the standardized scores on the scatter plot, rather than the raw scores, the slope of the regression line (or best fit line) would be r - the correlation coefficient.

As a result, we can think of the coefficient, b, as the unstandardized slope, and the correlation coefficient, r, as the standardized slope.

Running a Regression

in Minitab

reading social research

**Reading Social Research**

**1. What is the research question? What is the hypothesis that McAdam and Brandt are testing?**

**2. What is the population McAdam and Brandt are studying? How do they collect their sample? What concerns do they express about a representative sample?**

**3. Who are the three groups that McAdam & Brandt collect data on? Why might McAdam & Brandt expect different outcomes for each of these three groups? In other words, what might explain differences between these groups?**

**9. From Table 1, explain the findings for civic attitudes across the three groups in the research. What test did the authors use to test for differences in civic attitudes in Table 1? What was their null hypothesis? Explain what the p-values indicate in this table.**

**In our projects, we have selected a level of alpha (e.g., alpha = 0.05 or alpha = 0.01) and made a rejection decision based on that level of alpha.**

(Recall: Select alpha. Determine a corresponding critical value. Calculate a test statistic. Compare critical value and test statistic (or alpha and a p-value). Make a rejection decision.)

(Recall: Select alpha. Determine a corresponding critical value. Calculate a test statistic. Compare critical value and test statistic (or alpha and a p-value). Make a rejection decision.)

**In most social research, we see stars indicating the level of alpha at which we can reject the null hypothesis. It is simply a short-hand way of making rejection decisions.**

Remember: If p < alpha, we reject the null hypothesis.

+ p < 0.1

* p < 0.05

** p < 0.01

*** p < 0.001

The stars simply indicate whether the p-value is less than alpha for each test. For example, if your p-value was 0.04, we would indicate that with * (because 0.01 <

0.04

< 0.05). With 95% confidence, we can reject the null hypothesis. If your p-value was 0.007, we would indicate that with ** (because 0.001 <

0.007

< 0.01). At 99% confidence, we can reject the null hypothesis.

Remember: If p < alpha, we reject the null hypothesis.

+ p < 0.1

* p < 0.05

** p < 0.01

*** p < 0.001

The stars simply indicate whether the p-value is less than alpha for each test. For example, if your p-value was 0.04, we would indicate that with * (because 0.01 <

0.04

< 0.05). With 95% confidence, we can reject the null hypothesis. If your p-value was 0.007, we would indicate that with ** (because 0.001 <

0.007

< 0.01). At 99% confidence, we can reject the null hypothesis.

**11. In Table 2, why do the authors use a chi-square test in the column evaluating "proportion doing service"? For one of the rows, explain the results of the chi-square test.**

**13. How would you graph the findings in Table 4? Why did you choose that kind of graph? What would your graph show us?**

**14. Explain the findings in Table 8. What type of test did the authors use? What was their null hypothesis? How would you explain their findings to a lay (non-statistical) audience)?**

**15. What is a logistic regression?**

**When we talked about Ordinary Least Squares (OLS) regression,**

our outcome measures (or dependent variables) were continuous

.

- hours spent studying

- lifetime income

- score on an exam

In the multivariate framework, we used multiple predictor variables (independent variables) to evaluate changes in the dependent variable.

Our mantra (which, by now, you should have memorized) is that a one-unit change in X is associated with a b-unit change in Y.

our outcome measures (or dependent variables) were continuous

.

- hours spent studying

- lifetime income

- score on an exam

In the multivariate framework, we used multiple predictor variables (independent variables) to evaluate changes in the dependent variable.

Our mantra (which, by now, you should have memorized) is that a one-unit change in X is associated with a b-unit change in Y.

**But what if our outcome measure is dichotomous and we want to use multiple predictors?**

- Voted (yes/no)

- Passed the course (yes/no)

- Accepted to Georgetown (yes/no)

Earlier, we used a chi-square test to determine whether two nominal variables were associated with each other, but we don't (yet) have a framework to understand the joint impact of multiple variables on a dichotomous outcome.

- Voted (yes/no)

- Passed the course (yes/no)

- Accepted to Georgetown (yes/no)

Earlier, we used a chi-square test to determine whether two nominal variables were associated with each other, but we don't (yet) have a framework to understand the joint impact of multiple variables on a dichotomous outcome.

**The logistic regression tells how a change in an independent variable affects the odds of a dichotomous event occurring.**

- 1-unit increase in your high school GPA increases the odds of being accepted to Georgetown.

- Each additional hour spent studying is associated with an increase in the odds of passing the course.

- Holding a college degree is associated with an increase in the odds of voting in an election.

Remember: The odds are simply the probability that an event will occur divided by the probability that an event will not occur.

- 1-unit increase in your high school GPA increases the odds of being accepted to Georgetown.

- Each additional hour spent studying is associated with an increase in the odds of passing the course.

- Holding a college degree is associated with an increase in the odds of voting in an election.

Remember: The odds are simply the probability that an event will occur divided by the probability that an event will not occur.

**12. Table 3 reports t-tests for the difference in means between graduates and drop-outs (rows 1 + 2) and graduates and non-matriculants (rows 3 + 4). [Note: This table should probably be broken into two separate tables to clearly indicate the comparisons.] Explain why the authors used a t-test in this table. Explain their findings from Table 3, and the possible explanations they give.**

**16. If asked to summarize the findings from this article in just a couple sentences, what would you say? Write 2-3 sentences summarizing the findings from McAdam and Brandt's article.**

15. Why do researcher create confidence intervals?

What is the proper interpretation of a confidence interval?

16. What happens to a confidence interval when researchers want more certainty? What happens to a confidence interval when the sample size increases? Why?

17. Of the 50 students in an Introduction to Sociology class, 9 earn As, 21 earn Bs, 15 earn Cs, and 5 earn Ds. Construct a frequency table, including the frequency, percentage and cumulative percentage.

The confidence interval is a range of values, centered around the sample statistic. It specifies the degree of confidence that we have that the true population parameter falls within that range.

Here is the correct interpretation of the confidence interval: If you were to repeatedly draw samples, the true population parameter would fall within the confidence intervals 95% of the time (for a 95% confidence interval).

Top Hat

: Give an example of a discrete

variable (either ordinal or nominal).

Top Hat:

For extremely liberal men, what is the difference between the expected value and the observed value?

Top Hat:

If you were testing for a relationship between

political party affiliation (Republicans vs. Democrats)

whether or not someone had a college education

(Yes vs. No), which test would you use?

Top Hat:

If you were testing whether the expected income for college students after graduation was the same across

African-American, Latino, Asian and white students, what

test would you use?

Top Hat:

If you were testing whether the number of hours spent studying each week was the same for students who participated in clubs on campus and those who didn't participate in clubs on campus, which test would you use?

Top Hat:

If you were testing whether the average number of times a student went to Mass annually at Georgetown was greater than 10, which test would you use?

Top Hat:

An example of two variables that

are uncorrelated? (Remember: Both

variables must be continuous.)

Top Hat:

If I want to predict the amount of

crime in a neighborhood (e.g., the number of

robberies per 1,000 households), give me

an example of an independent variable (X)

that might be related to the level of crime

in a neighborhood (Y).

What is the unit of analysis in this problem?

(Not on Top Hat)

If I randomly selected a neighborhood, and

knew nothing about the characteristics of that

neighborhood, what would be my best guess

of the level of crime in the neighborhood?

**Concept: Regression**

Formula

Formula

Top Hat:

What is the dependent variable (Y)?

Top Hat:

What is the regression coefficient (b)?

Question: How do you think the age that people have their first kid varies between blacks, whites and others? What test will we use to test the null hypothesis?

Question: Do you think the average age that people have their first kids varies by citizenship status (citizen vs. non-citizen)? Why or why not?

Question: Do you expect the level of social trust (people can always, usually, sometimes, or never be trusted) to vary by social class (working class, lower class, middle class, upper class)? Why? What test would you use to test it?

Question: Do you expect feelings about the Bible (word of God, inspired tales, book of fables) to vary by social class? Which test would you use?

educ = a + (b)paeduc

- What is the dependent variable?

- What is X?

- Finish this sentence: "A one-unit change in age is associated with a .... "

- Why isn't the Y-intercept a meaningful statistic to interpret?

- Write out the prediction equation.

- What is the predicted number of hours of television for a person who is 21 years old?

18. Explain when an outlier matters for the various measures of central tendency we used in the class, and the various tests that we have explored.

19. What's the difference between descriptive and inferential statistics?

20. What are the null hypotheses for an ANOVA, a t-test and a chi-square test?

21. In the School of Continuing Studies, there are 100 women studying Journalism, 123 women studying Urban Planning, 92 women studying human resources, 55 men studying Journalism, 72 men studying Urban Planning and 44 men studying human resources. Make a cross-tab. What statistical test would you use to test for an association between gender and degree program?

22. How do you determine critical values for Chi-Square tests, t-tests and ANOVAs?