Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Statistics for Social Research, Part III

No description
by

Brian McCabe

on 7 December 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Statistics for Social Research, Part III

Statistics for Social Research
Professor McCabe

t-tests/difference in means
ANOVA (analysis of variance)
relationships between nominal
variables (e.g., chi squared)
correlation and linear regression
multiple regression
thinking about causality
final review
t-tests/difference in means
ANOVA
(analysis of variance)
review: why we never
"accept" the null hypothesis
Group A
: You're testing the null hypothesis that
the mean GPA of students at Georgetown is equal
to 3.00 You draw a
sample of 155 students
. Your
sample mean is 3.09 and your standard deviation
is 0.9. Set alpha equal to 0.05. What is your critical
value? Calculate your test statistic. Make a rejection
decision.

Group B
: You're testing the null hypothesis that
the mean GPA of students at Georgetown is equal
to 3.00 You draw a
sample of 425 students
. Your
sample mean is 3.09 and your standard deviation
is 0.9. Set alpha equal to 0.05. What is your critical
value? Calculate your test statistic. Make a rejection
decision.
Group A:
Xbar = 3.09
sd = 0.9
n= 155
se = 0.073
Z = 1.245
Z<CV
Fail to reject

Group B:
Xbar = 3.09
sd = 0.9
n= 425
se = 0.044
Z = 2.062
Z>CV
Reject
Last week:
one-sample
hypothesis test,
means was different from a target value

This week:
two-sample test for the difference
in means
; mean of one group differs
from the mean of another group; ANOVA to
test for mean differences across multiple
groups.
Concept: Bivariate
Relationship
A bivariate relationship is simply the relationship between two (bi) variables (variate).

We are interested in thinking about how levels of one variable (the dependent variable) change across levels of another variable (the independent variable)
This week: How levels of a continuous variable change across levels of a discrete (dichotomous, nominal, ordinal) variable.

Next week: How levels of a discrete variables change across levels of a discrete variable.

The following week: The relationship - or correlation - between two continuous variables.
This week: Does the number of hours spent studying (continuous variable) differ between male and female students (discrete variable).

Next week: Does voting in a GUSA election (dichotomous - yes/no) differ between freshman, sophomores, juniors and seniors (ordinal).

The following week: Is the number of basketball games won by the Georgetown Hoyas each year (continuous) associated with the average GPA of the Georgetown student body (continuous)
This week, we will be looking at continuous variables (e.g., height, GPA, test scores, rates, etc.) across levels of discrete variables (e.g., race, sex, year in school, dichotomous, etc.)
Concept: Two Sample
Difference in Means
Rules for a Two Sample Difference in Means

1. The variable you're interested in is continuous.

2. Your groups are independent. In other words,
they do not include the same people/subjects.

3. There is equal variance in the populations. We
often assume this to be true, and check the variance
(or standard deviation) in the samples for
confirmation.
Concept: Standard Error
for the Difference in Means
Standard Error for the
Difference in Means
Variance for the first sample
Degrees of freedom for the first sample
Degrees of freedom
Concept: Steps
for a t-test
Null Hypothesis: The mean from
the first population equals the mean from the second population
For the difference in means where equal
variance is assumed, the test statistic is
called the
t-test
.
t-test for the difference in means
the mean of sample 1
the mean of sample 2
standard error of the difference in means
1. State the Null Hypothesis
2. State the Research Hypothesis
3. Get your sample statistics (Xbar).
Calculate the standard error.
5. Decide on a one-tail or
two-tail test.
6. Decide on your level of
alpha. Determine the critical
value.
4. Calculate your test statistic.
7. Make a rejection decision.
8. Interpret your results.
Practice Problem
In a random sample of American adults (n=641), researchers wanted to know whether men and women hold different attitudes towards gun control. Survey respondents were asked ten questions about gun control. The answer to each question was coded "1" if the respondent supported stricter gun control laws and "0" if the respondent did not support stricter gun control laws. After aggregating these questions, each respondent ended up with a score between 0 and 10.

The researchers found that in the sub-sample of men (n=324), the mean score was 6.2 and the standard deviation was 1.3. In the sub-sample of women (n=317), the mean score was 6.5 and the standard deviation was 1.4. Can the researchers conclude that there is a gender difference in attitudes toward gun control?
Xbar1 = 6.2
Xbar2 = 6.5
se = 0.10668
t=-2.812
cv=1.96
reject the null
Concept: Matched
Pair Sampling
In our t-test thus far, we have required independent samples (without overlap between the two samples). Another kind of t-test to test for mean differences involves "matched-pair" sampling. We can think of this as pre/post or before/after sampling on the same group of people.
We run an intervention in freshman dorms to teach students about diversity. I take a sample of freshman and give them a pre-test (before the diversity training) and a post-test (after the diversity training) to determine the effect of the training.
Standard Deviation
Difference between
matched-pair scores
Mean of the differences
between the matched-
pair scores
Standard Error
Tests Statistic for Matched-Pair Scores
Researchers would like to measure whether liberals and conservatives report different levels of support for health care reform in the United States. On a scale of 0-100, with 100 being strongly support, they ask a sample of 60 people how much they support health care reform.

In their sample of 25 liberals, they find a mean score of 60 and a standard deviation of 12. In their sample of 35 conservatives, they find a mean score of 49 and a standard deviation of 14.
Xbar1 = 60
Xbar2 = 49
se = 3.52
t = 3.13
df = 58
If alpha = 0.05, cv = 2.021
(note: t-distribution because
n<121)
t>cv
reject null
Concept: ANOVA
We used a t-test to compare the difference in means between two groups (e.g., male/female, students/professors, college graduates/non-graduates). But what happens when we want to compare the means between three or more groups (e.g., freshmen/sophomore/junior/senior; Protestants/Catholics/Jews/Others). In those cases, we use an F-ratio calculated through an analysis of variance.
Considered as an extension of a t-test, we can write out the null hypothesis as follows:
In principle, the ANOVA examines two types of variation. First, we are looking for
within-group
variation. Within each group (e.g., Republicans, Democrats and Independents), how much variation is there around the group mean. Second, we are looking for
between-group
variation. How much does the mean
score for each group (e.g., Republican, Democrats and Independents) vary.
ANOVA compares the amount of variation
between
categories (e.g., between R, D and I) with the amount of variation
within
categories (e.g., among R, D and I)
The greater the differences between categories (means) relative to the differences within categories (standard deviations), the more likely we are to reject the null hypothesis.
If the mean score does, in fact, vary across categories, we would expect the sample means between categories to be differ substantially, but the dispersion within categories to be relatively small.
Does support for capital punishment (measured on a scale of 1-10) vary across religions?

Is Hoya Pride (on a scale of 1-10) different across years in school?

Does average BMI change across regions of the country?

Is median income different is cities, suburbs and rural areas?

Do Democrats, Independents and Republicans vary in their score on a political ideology test?
Concept: Sum of
Squares Total (SST)
Concept: Sum of
Squares Between (SSB)
Concept: Sum of
Squares Within (SSW)
SST = the Total Variation of Scores
The SST measures the amount of variation
in the scores, relative to the Grand Mean
(or the mean of the total sample).
SSW = the Total Variation within categories.
The sum of squares within measures the amount
of variation within each of the categories (or how
far the individual scores fall from the group mean).
SSB = the total variation between categories.
The sum of squares between indicates how much
variation there is between the mean of each category.
Concept: Degrees of
Freedom (ANOVA)
dfw = degrees of freedom associated with SSW

Take the total number of observations.
Subtract the number of categories.
dfb = degrees of freedom associated with SSB

Take the total number of categories.
Subtract 1.
Concept: Mean
Square Estimates (MSE)
The Mean Square Estimates (MSE) are estimates of the population variance, and are calculated by dividing the
sum of squares by the degrees of freedom.
Mean square within =
Mean square between =
Concept: F ratio
The F ratio is your test statistic for an ANOVA. It compares the amount of variation between categories (SSB) to the amount of variation within categories (SSW).
(It is the equivalent to your t-test from an analysis of the mean difference between two categories.)
As with our t-tests, the higher the F-ratio, the more likely we are to reject the null hypothesis. In other words, the more variation there is between categories relative to the amount of variation there is within categories, the more likely we are to reject the null hypothesis that the mean score for each of the categories is the same.
Concept: Steps to
Running an ANOVA
1. State your assumptions.
- Independent, random samples
- Continuous outcome
- Populations are normally distributed
- Population variances are equal across groups
2. State your null hypothesis
3. Determine your degrees of freedom.
Decide on alpha. Look at the F-distribution.
Determine your critical value.
4. Calculate your test statistic (F-ratio) using SSW, SSB, MSE, dfw, dfb)
5. Compare your test statistic with the critical value and make a rejection decision.
6. Interpret your decision.
Review: One sample,
Two samples, Three
(or more) samples ...
1: One sample, compared to a hypothesized mean.
t-test (small sample) or Z-scores.

2: Two groups, comparison of means.
t-test (small samples) or Z-scores.

3: Three or more groups, comparison of means.
F-ratio
1: Is the average GPA for
Georgetown students equal to 3.50?

2: Is the average GPA for Georgetown females
greater than (or different from) the average
GPA for Georgetown males?

3: Does the average GPA differ between Freshmen,
Sophomores, Juniors and Seniors?
1: Is the mean score on a religious tolerance exam (scores: 1-10) greater than 5?

2: Are people who go to church regularly more tolerant than those who don't regularly go to church?

3: Does the level of religious tolerance vary by the frequency of church going (e.g., At least once a week, at least once a month, at least once a year, never)
chi-square (to test for
relationships between
nominal variables)
Concept: Discrete
Variables (refresher)
Discrete, or nominal, variables are those measures that fit into categories.

- Religious groups
- Color car you drive
- Whether you voted
- Region of the country where your parents live
- Current dorm
Concept:
Independence
A chi square test is the test used to test the association between nominal variables. Importantly, it is a non-parametric test, meaning that it makes no assumptions about the distribution of variables (e.g., normally distributed, etc.)
Concept:
Bi-variate tables
(or Cross-Tabs)
Before we discuss the chi-square test, we need to consider the construction of a cross-tab. A cross-tab (or cross-tabulation) is simply a bi-variate table showing the relationship between two discrete variables in your data. Bi-variate simply refers to two (bi-) variables (variate).
Source: General Social Survey
Rows!
Columns!
Row Marginals!
Column
Marginals
Two variables are said to be independent if the classification of an observation into the category of one variable has no effect on the probability (or likelihood) that the observation will fall into a category of another variable. In other words, knowing something about where an individual falls on one variable (e.g., hair color) tells us nothing about where they are likely to fall on another variable (e.g., did you vote).
Examples of variables that we
might imagine to be independent.

- Gender (M/F) and Metropolitan Status
(e.g., urban, suburban, rural)
- Gun ownership (Yes/No) and Religious
Denomination (e.g., Catholic, Protestant, etc.)
- Majority religion in a country and
landlocked status (Yes/No).
Examples of variables that we might
imagine not to be statistically independent.

- Gender and Romney/Obama
- Age group and voting
- Race and Ward of DC where you live.
- Whether you were in the top quartile of
your high school class and whether you are
in the top quartile of your college class.
- Religion and number of children
Concept:
Chi Square
The chi square test is a test of joint occurrences.
We want to know if the categorization of an observation on one variable (e.g., gender) is independent of the categorization that observation on another variable (e.g., political ideology).
Expected Frequency
: The cell frequencies we would expect to find on account of random chance. This is the cell frequency we would expect to find if the variables were fully independent.
Observed Frequency
: The actual frequency observed in the bi-variate table.
Test statistic: Chi Square Obtained
Steps for the
Chi Square Test
1. Test Assumptions:
One population from a random sample
The level of measurement is nominal/ordinal
Expected frequency of each cell >=5
2. State the null hypothesis (that
the variables are independent)
3. Select your level of alpha. Note your
degrees of freedom. Determine your
critical value.
Degrees of Freedom:
df = (r-1)(c-1)
4. Compute the Chi Square test statistic,
using both the expected frequency and
the observed frequency.
5. Make a rejection decision.
6. Interpret your results.
review: confidence intervals,
one-sample t-tests, one-tail vs.
two-tail tests, Type I vs. Type II
errors, critical values.
#1: Do women at Georgetown have higher
GPAs than men at Georgetown?
#2: Do children raised in heterosexual, two-parent
households have better educational outcomes than
children raised in same-sex, two-parent households?
#3: Do poor children watch more hours of TV
each week than middle-class children?
Framework A: Two separate groups,
one point in time.
Framework B: One group of people,
two points in time (i.e., before and
after an intervention); matched-pair
sampling.
#1: Does watching a set of campaign ads
change the amount that voters support
President Obama?
#2: Do diversity programs
increase tolerance and acceptance
among high school seniors?
Rather than testing whether a sample mean is equal to a target number, we are testing whether the mean of
one group (e.g., men) is equal to the mean of another
group (e.g., women).
What is the null hypothesis
for a test of the difference in
means (t-test)?
The formula for the standard error in the difference in means looks more complicated than the formula for the standard error with a single sample ...

However, as you'll see, it requires the same set of inputs -
the sample size (n) and the standard deviation for both samples.
The steps for conducting a t-test to test the difference in means between two groups are basically the same for testing with a single sample (last week).
Stata for two-sample
Difference in Means

An important part of the ANOVA is considering the group mean vs. the grand mean.

The
group mean
is (quite simply) the mean for each group being compared (e.g., mean of sophomores, mean of juniors, etc.)

The
grand mean
is the mean of the entire sample (i.e., sophomores, juniors and seniors together).
Concept: Group Mean
vs. Grand Mean
For the entire sample (n=276), mean = 3.89
For our research question, the null hypothesis
is that the mean political ideology score for
Democrats is equal to the mean political ideology
score for Republicans and equal to the mean
political ideology score for Independents.
Mean - Democrats
Mean - Independents
Mean - Republicans
Total Sum
of Squares
Individual Score
Overall (or
Grand) Mean
Group mean (for each group)
Grand mean
The total sum of squares is made up of two factors:
the within sum of squares and the between sum of squares
Practice Problem
Concern:
One of the challenges of the ANOVA is that when you reject the null hypothesis (of equal means), you accept an alternative hypothesis that is vague (that the means are not equal). Typically, you can't just "eyeball" the means to tell which ones are the results of true differences, and which result from chance alone. There are no two-tail tests or directional hypotheses.
df - within
df - between
critical values
Concept:
Null Hypothesis
Grand Mean
Concept: General
Linear Model
The general linear model states that the best prediction of the dependent variable for any particular case is equal to the mean score plus the effect of any independent variable.
If I randomly select one person from the population,
what is my best prediction of her political ideology score (knowing absolutely nothing about her)?
However, if I then learn that she is a Republican, I can
adjust my predicted score to account for the "additive
effect" of being a Republican in my data - essentially,
the difference between the group mean for Republicans
and the grand mean for the whole sample.
This score - the grand mean plus the added effect (+/-)
of being a Republican - is now the best prediction of her
score, but I also have to acknowledge error. There are
other factors (that I don't know) that will cause her
score to deviate from this predicted mean.
The general linear model basically decomposes a
predicted score into three parts: the mean score for the whole sample, the additive effect from particular
categories, and the error term.
Democrats: Mean Ideology Score = 3.23

Independents: Mean Ideology Score = 3.90

Republicans: Mean Ideology Score = 4.70

Overall Sample: Mean Ideology Score = 3.89

Research Question: Do these differences
represent true population differences, or is
it likely that they are the result of chance alone?
Group Means:
High School - 3.67
Young Adult - 4.42
Middle-aged - 5.58
Retired - 7.92

Grand Mean: 5.40
The best prediction of her
score is the group mean!
These results are from a random sample of respondents to gauge their interest in civic affairs. The scores range from no interest (0) to high levels of interest (10). Researchers wanted to know whether the level of interest in civic affairs varies by age.
So far, we have done hypothesis testing
for continuous variables across discrete
categories ...

One category - one-sample hypothesis test
(e.g., mean age when people have their first
child is equal to 23)

Two categories - two-sample t-test
(e.g., mean age when people have their first child
differs for men and women)

Three or more categories - ANOVA, F-ratio
(e.g., mean age when people have their first child
differs across racial groups - Black, White, Asian,
other)
Today, we will turn to an analysis
of discrete (ordinal/nominal) variables
across a range of categories (ordinal/nominal).
Top Hat:
Write down a pair of discrete variables
that you would expect to be statistically
independent of one another? In other
words, knowing your category in one of
the variables tells us nothing about your
likelihood of being in a particular category
of the other variable.
Top Hat:
What are some pairs of discrete variables
that you would not expect to be statistically
independent of one another? In other words,
name a pair of discrete variables where
knowing your category on one variable
tells us something about what category you
fall into for the second variable.
Cells!
(Showing Joint Frequency)
Column Percentages
(calculated by dividing
the joint frequency by
the column marginal; in
other words, the number
of extremely liberal women
divided by the total number
of women.)
Column Percentages
The null hypothesis is that two variables (in a bi-variate table) are independent of each other. If the null hypothesis were true, then we would expect the cell frequencies to be the result of random chance alone.
46160
Grand Total
(Total number
of observations)
Compared to your
Chi Square Critical
Values
Practice Problem
#1: Researchers want to determine whether
homeowners and renters vary in their support
for stronger gun control laws. They sample
656 people and observe the following frequencies.

- 308 Homeowners favor stronger gun control
- 119 Homeowners oppose stronger gun control
- 175 Renters favor stronger gun control
- 55 Renters oppose stronger gun control

Setting alpha equal to 0.05, determine whether
there is a relationship between homeownership
and support for gun control in America.
#2: Researchers want to determine whether
support for stronger gun control laws differs
according to which candidate individuals supported
in the 2008 election. They sample
810 people and observe the following frequencies.

- 379 Obama supporters favor stronger gun control
- 194 McCain supporters favor stronger gun control
- 12 supporters of other candidates favor stronger gun control
- 92 Obama supporters oppose stronger gun control
- 123 McCain supporters oppose stronger gun control
- 10 supporters of other candidates oppose stronger gun control

Setting alpha equal to 0.05, determine whether
there is a relationship between which candidate
a person voted for and support for gun control in America.
Top Hat:
In our political data, what is the expected frequency for conservative men?
What is the expected frequency for extremely liberal women?
And what is observed frequency for conservative men? For extremely liberal women?
Correlation
We often talk about social phenomena that are correlated.
When we discuss correlation, we're considering two continuous measures that
co-vary
- or that vary together.
When the value of one variable systematically changes as the value of the second variable change, we say that the two variables are
correlated.
Height, Shoe Size and the
Amount of Money in your Wallet
Concept:
Scatter Plot

Concept: Pearson
Correlation Coefficient

Concept: Direction
Concept: Strength
Concept: Correlation
vs. Causation

Concept: Linear
Relationship

Concept: Curvilinear
Relationship

Concept: Coefficient
of Determination,
r-squared

Concept: Residual
Variance

Concept:
Best Fit Line

A scatter plot is a two-dimensional graph that shows
the coordinates between two variables - X and Y - for
all the observations in a data set. It provides visual evidence to assess whether two variables are correlated.
As the size (measured in carats) of a diamond goes up,
the price goes up. We would say that size and price are
positively correlated.
As reading scores increase, writing scores increase, as well.
We would say that reading scores and writing scores are
positively correlated.
Each dot on the scatter
plot is a different observation
in our data (in this case, each
dot is a different student
in our data)
Scatter plot of height and shoe size.

Scatter plot of height and money in your wallet.
Two continuous variables - X and Y - can
be said to be related in one of two ways:

1. Positive Correlation.
- When the value of X increases, the value
of Y increases.

2. Negative Correlation.
- When the value of X increases, the value
of Y decreases.
Top Hat
: An example of two variables that are
positively correlated? (Remember: Both variables
must be continuous!)
X: Hours studied
Y: Score on an exam

X: Temperature
Y: Number of people going to the beach
for the weekend

X: Number of ice cream cones sold
each day
Y: Number of bottles of water sold
each day
Top Hat:
An example of two variables that
are negatively correlated? (Remember: Both
variables must be continuous.)
X: Number of books read
Y: Hours spent watching TV

X: Hours slept
Y: Time spent socializing
When a change in one variable has no relationship with a change in a second variable,
we say that that variables are uncorrelated.

When no correlation exists, a change in X is unrelated to a change in Y.
In addition to noting the direction of a correlation, we can talk about how strong the correlation is.

For example, shoe size and height are very strongly correlated. We can have a pretty good guess about what your shoe size is when we know your height.

Other variables have an association, but the correlation is much weaker. For example, we might know that hours slept is weakly correlated with exam scores. There is a relationship between them, but it is not a particularly powerful.
For countries around the world, what do you think the relationship is between average life expectancy and mean number of years of schooling?

- Positive correlation?
- Negative correlation?
- Uncorrelated?
How do I know that this is the line that
"best fits" the data?
There are an infinite number of
lines that I could draw through the
data. How do I know which one
is the "best fit" line?
The "best fit" line.
This line is the mean
of years of schooling
(7.2) for the sample.
Regression line
Without the best fit line, our best guess of the mean number of years of schooling would simply be the mean of the sample.

However, the best fit line helps us to more accurately guess the mean years of schooling when we know the life expectancy of a country. Because the variables are correlated, knowing something about X tells us something about Y.
Concept:
Predicting Y

One of the main reasons we look for correlations is that it helps us improve our prediction of Y. Without any other information, our best guess of the value of Y for any variable is the mean of the sample.
Note: Pearson's r always ranges from -1 to 1.
The sign indicates whether the variables are positively or negatively correlated.
The value (absolute value) indicates the strength of the correlation.
-1 indicates a perfect negative correlation
1 indicates a perfect positive correlation
0 indicates that variables are uncorrelated

Example: I want to predict the shoe size for a random person. If I know nothing about that person, my best guess of that person's shoe size is the sample mean. However, if I know something about that person's height,
I can make a more accurate prediction. Height and shoe size are correlated.

Example: I want to know how many years of schooling an individual has completed. Knowing nothing else about that person, my best guess for the number of years of schooling completed is the sample mean. However, if I know something about their mother's level of education, I can make a better prediction of their education because your education level and your mother's education level are correlated.

Example: I want to predict the average number of years of schooling completed in country A. If I know nothing else about country A, my best guess of the average number of years of schooling would be the mean for the sample. However, if I know the GDP in the country, I can make a better guess because average life expectancy and GDP are correlated.
The "best fit" line is the line
that minimizes the amount of
error between each observation
and the regression line.

Later, we will talk about
minimizing the sum of the
squared error.

For the moment, suffice it to say
that the "best fit" line is the line
that best reduces the amount
of error between each observation
and the line.
For each observation, the difference between the observed value and the predicated value is the error term.
Practice Problem: Five individuals report the number
of hours of TV they watch and the number of hours
they spend reading. Calculate the correlation coefficient
for hours of TV watched and hours spent reading.

Bill: 5 hours of TV, 11 hours reading
Michelle: 7 hours of TV, 8 hours reading
Anne: 8 hours of TV, 5 hours reading
Hillary: 7 hours of TV, 6 hours reading
George: 3 hours of TV, 10 hours reading
Start with a simple scatter plot. What does the scatter plot tell you about the correlation between hours spent watching TV and hours spent reading?
Calculate the correlation coefficient.
However, when I know something about a variable
that is correlated with our outcome - in this case, Y
- then I can make a better prediction.
Shoe Size
In our data, let's calculate Pearson's
r for the relationship between
shoe size and height. (Any guesses?)

Now, let's calculate Pearson's r for
the relationship between money
and height. (Any guesses?)
Linear regression is the bread & butter of social science research. If you can master linear regression, you have the basic building block for more advanced topics in quantitative social science.
Predicted value of Y
Y-intercept
Slope of the regression line
(through the scatter plot)
The observed value of Y
Error term (or the difference
between the observed value of
Y and the predicted value of Y)
The formula for the Y-intercept:
Y-intercept
The mean of Y
The slope of the
regression line
The mean of X
The Y-intercept is the place where the regression line crosses the Y-axis.
We often refer to this as the
constant
.
We can also think of the Y-intercept as the value of Y when X=0.
The formula for b, the regression coefficient, is:
The numerator is the covariation of X and Y -
a measure of how much X and Y vary together.
The regression coefficient is simply the slope of the line
that runs through your scatter plot. (From previous classes, you may be familiar with the idea of a slope as
rise over run ... )
Interpreting a regression coefficient:

1. Talking about how one variable co-varies alongside another: We typically say a one-unit change in X is associated with a b-unit change in Y.

2. Talking about the predicted value of Y, based on the regression line.
Concept: Plotting
the Regression Line

An example using real data ...
After calculating the Y-intercept (a)
and the regression coefficient (b), we
have enough information to overlay
our regression line atop a scatter plot.
Did you notice a similarity in the formulas for
the regression coefficient (b) and the correlation
coefficient (r). The correlation coefficient is
basically for the standardized scores, rather than
the raw scores. If we plotted the standardized
scores on a scatter plot, our best fit line would
have the slope of the correlation coefficient.
Mother's Education (Mean): 11.662
Respondent's Education (Mean): 13.907
Regression coefficient, b: 0.437
Each point is an observation from our dataset.
The observed value for that point is Y.
The predicted value for that point is .
The different between the observed value and
the predicted value is the error, e.
e
e
Concept: Proportional
Reduction in Error (PRE)

Knowing nothing about a particular observation,
we know that our best guess of their level of education
would have been the sample mean. When we know something about that person's mother's level of education, we can use our regression line to make a better prediction about our respondent's level of education. We can make a better prediction of Y (education) knowing something about X (mother's education), but
how much better will our prediction be?
To calculate the PRE, we first make a prediction assuming we know nothing about the independent variable, X.

Then, we make another prediction using the information we know about the independent variable, X.

Quite simply, the PRE tells us the proportional reduction in errors when we know X vs. when we don't know X. How much better did we do in predicting the outcome when we know X than when we didn't know X?

In this case, how much better did we do in predicting respondent's education when we know mother's education, rather than when we only know the sample mean?
Concept:
Outliers

What is an outlier?
How does it affect the
mean of a distribution?
In regression analyses, outliers can "pull"
the regression line up, leading the regression
line to "misfit" the data.
One of the advantages of the scatter plot
is that you can visually see the outliers in
your data. There are tools - called regression
diagnostics - that we can use for evaluating
the presence and the impact of outliers
in our regression analysis.
For starters, we will use a linear regression to talk about
how a change in the level of one continuous variable
is
associated
with a change in the level of another
continuous variable. In doing so, we can predict the
level of our dependent variable (Y) with information
from our independent variable (X).
Today, we will start making a distinction
between our independent and dependent variables.
Our dependent variable - the one we
want to predict - is Y. Our independent variable -
the one we're using to make that prediction - is X.
Examples:

1. If we want to know how mother's level education is associated with the level of respondent's education, then mother's education is X and respondent's education is Y.

2. If we want to know whether the number of violent crimes in a neighborhood is associated with rates of passing exams in neighborhood schools, we are using number of crimes (X) to predict passing rates on exams (Y).

3. If we want to know whether the on-time arrival percentage airlines predicts the number of customer complaints airlines receive, the the percentage of on-time arrivals is our independent variable (X) and the number of customer complaints is our dependent variable (Y).
Top Hat:
If I want to predict a person's income,
give me an example of a continuous independent
variable (X) that might be related to a person's
income (Y).

What is the unit of analysis in this problem? (Not on
Top Hat)

If I picked a random person, and knew nothing else
about that person, what would be my best guess
of that person's income? (Not on Top Hat.)
Let's start with an easy example ...

X = Percentage of flights that arrive on-time
Y = Number of complaints received

Airline 1: X = 80%, Y=200
Airline 2: X = 40%, Y=210
Airline 3: X = 90%, Y=140
Airline 4: X = 60%, Y=230
Airline 5: X = 70%, Y=130
Concept: Testing
the Significance of
the Pearson's Correlation
Coefficient, r

As with other statistics obtained from a sample, we
may want to test when our correlation coefficient, r,
is statistically significant. In other words, does the
linear relationship between X and Y truly exist in
the population, or is it the result of sampling error?
Null Hypothesis:
Test statistics (to be used with critical values
in the t-table):
Calculate the Y-intercept (a)
and the slope of the regression
line (b)
Question: If you knew nothing about the
percentage of on-time arrivals for a particular
airline, what would be your best guess of the
number of customer complaints it received?

Question: If you knew that an airline had an
on-time arrival rate of 75%, what would be
your best guess of the number of customer
complaints it received? (Calculate Y-intercept,
calculate regression coefficient.)
The regression line is also the line
that
minimizes the sum of the squared
error terms
.

If you took each error term - the difference
between the observed value of Y and the
predicted value of Y - and squared them (to
make them all positive) and added them up,
there is no other line you could draw that
would make that value smaller.

In statistics, we often call this
Ordinary
Least Squares (or OLS)
regression.
Slope (b) = 1.466
Y-Intercept (a) = 10.90
Slope (b) = 0.24
Y-Intercept (a) = 78.11
When we calculate a regression line, it's worth asking,
"How much better are we at predicting Y when we know
X than when we don't know X?"

In other words, do we get a better prediction of respondent's education (Y) when we know something about mother's education (X) than when we don't?

How much better are we at predicting a country's life
expectancy rate (Y) when we know the mean education level (X) rather than when we know nothing about the level?
The coefficient of determination tells us how much of the variation in Y can be "explained" by its relationship to X.

It is, quite simply, the square of Pearson's correlation coefficient, r. We typically refer to the correlation of
determination as "r-squared".
Explained variation in
Y. (Explained by the
regression line.)
Total variation in Y.
The observed values do not fall close to
the regression line. The regression line
is not doing a great job explaining the
variation in Y. There is still quite a bit
of error.
The observed values fall quite close to the regression line. The line does a good job of explaining the variation in Y. There is relatively little error in this figure.
Low R-Squared
High R-Squared
Multiple
Regression

Concept: Interpreting the
Regression Coefficients

Y-Intercept (a)
Slope (b)
(also known as your
regression coefficient)
Coefficient of determination
test statistics
and p-values
(test statistics
test whether the
coefficient is different
from zero)
Dependent variable
Independent
Variable(s)
How do we interpret the regression coefficients?
What do they mean?
First, we rarely interpret the Y-Intercept (a).
It's not a very meaningful statistic.
However, the regression coefficient is extremely important. To interpret the regression coefficient,
we usually say: "A one-unit change in X is associated with a b-unit change in Y," where b is the regression coefficient.
"A one-unit increase in mother's education is associated with negative 0.083-unit change in the number of hours of television watched."

Or, less jargony ...

"When the mother's level of education goes up by one year, we would expect the number of hours of television watched to go down by about 0.083 hours."
Concept:
Predicting Y

Sometimes, we'll call the regression equation
the "prediction equation" because it allows us
to make a prediction of Y, given our knowledge
of X.
Take this prediction equation:
Top Hat:
What is the predicted level of education
for someone whose father has fifteen years
of education?
Top Hat:
What is the predicted level of education
for someone whose father has eight years
of education?
Top Hat:
What is the predicted level of education
for someone whose father has zero years
of education?
When we talk about multiple regression - or multivariate regression - we are simply talking about adding more predictor variables to the equation. Instead of a single
independent variable, we will now have multiple independent variables (multiple Xs)
Why would we want multiple predictor variables?
1. Knowing more information about our observations will help us make better prediction decisions.

When X and Y are correlated, knowing something about X enables us to make better predictions about Y. When there are two independent variables that are correlated with Y, it often helps us make even better predictions. And so on and so on ...
2. The world is messy! All dependent variables are influenced by many things.

For example, if I want to predict your income (Y), there are lots of important independent variables to consider - your level of education, your gender, the prestige of your university, your parents' income, the type of industry where you work, etc.

I could run bivariate regressions for each of these, but as you'll see, it's better to put them into a single prediction equation.
3. We want to know if the relationship we observe in the bivariate framework could, in fact, be explained by the addition of an additional variable.

In other words, we want to know whether there is a
direct relationship
- X---->Y - or whether the relationship is
spurious
- namely, a third variable (Z) is causing the change in both X and Y.
Concept: Partial
Correlation Coefficient
Concept: Partial
Slope Coefficient
The bivariate correlations we've calculated are known
as
zero-order correlations
. The correlation between X and Y is a zero-order correlation.
Partial correlation coefficients take into account the influence of a third variable - Z - when figuring out the correlation.
What is the zero-order
correlation between the amount
of housework done by the husband
and the number of children in the
household.
Correlation of Y and X = 0.50
Now, we want to know if the correlation between housework and children is affected by the husband's
years of education.
Partial correlation coefficient:
Correlation of X and Y, controlling for Z
Correlation of Y and X = 0.50
Correlation of Y and Z = - 0.30
Correlation of X and Z = - 0.47
Concept: Multiple
Regression Equation
Multiple predictors of
our dependent variable
For example, we might want to know how your level of education (X1), your parents' income (X2) and your college GPA (X3) influence your post-college income (Y)
Bivariate Linear Regression Equation
Multivariate Linear Regression Equation
# of ice cream cones sold # of fires in the city


# of fires in the city

temperature/heat

# of ice cream cones sold
A couple ways to think about partial correlation:

What is the correlation for X and Y
controlling
for Z?
What is the association between X and Y when we factor out the influence of Z?
What is the correlation between X and Y independent of the influence of Z?

These are three ways of saying the same thing!
Back to our ice cream cone example, the zero-order correlation between ice cream cones sold and the number of fires would be positive (and strong).

However, controlling for the temperature, we would expect the partial correlation of these two variables to be zero. Factoring out the influence of the temperature, there should be no association between ice cream cones sold and the number of fires in a city.
Top Hat:
Name three independent variables (X) that might be associated with the amount of money an individual has in his or her retirement account (Y)?
Slope coefficient
in the bivariate case
Slope coefficient in the multivariate case
(when there are 2 independent variables)
Y-Intercept in the
multivariate case
(when there are 2
independent variables)
Correlation of Y and X = 0.50
Correlation of Y and Z = - 0.30
Correlation of X and Z = - 0.47
Standard deviation of Y?
Standard deviation of X?
(Good practice for the final!)
Concept: Interpreting
Multiple Regression Analysis
In the bivariate framework, we talked about the
impact of a one-unit change in X on the predicted
value of Y.
In the multivariate framework, we are beginning to look at the simultaneous impact of multiple predictors on the predicted value of Y.
Top Hat:
Using counties as the unit of analysis,
we want to study the relationship between education (measured by the % of people with a high school degree) and the level of crime (measured by the # of crimes per 10,000 people). Would you expect the number of crimes committed in a county (Y) to go up or down as the percentage of people holding a high school degree (X) increases?
Top Hat:
In one sentence, interpret this prediction equation.

Top Hat:
What is the predicted number of crimes
per 10,000 people in a county where half
of the population holds a high school degree?
Why is that? Why does the regression equation show an increase in the number of crimes as education increases when we would expect the opposite.
The relationship is spurious. (Remember: The ice cream cone/bottle of water example!) There is a third variable correlated with both of these ... but what is it?
Urbanization
(% of people living
in an urban area)
Crime
(# of crimes
per capita)
Education
(% of people holding
a high school degree)
+
+
Partial Regression Coefficient:
Standard Deviation, Crime = 28.193

Standard Deviation, Education = 8.859

Standard Deviation, Urbanization = 33.969
Interpretation?
Predicted number of crimes per 10,000 people
when 50% of people live in cities and 40% of
people hold a high school degree.
Notice: When we control for the urbanization rate in our
regression analysis, the sign on education flips from positive to negative. Controlling for the level of urbanization in a county, the percentage of high school graduates is negatively related to the number of crimes committed!
How do we interpret the multiple regression coefficients?

Controlling for the level of urbanization in a county, a one percentage-point increase in the number of high school graduates is associated with a decline of 0.58 crimes committed per 10,000 people.
Second example: Using state-level data,
we want to predict the relationship between
the percentage of poor people in a state and the
violent crime rate.

Expectations?
An outlier!
What would happen
to the regression line, the
correlation coefficient (r)
and the R-squared value
if I took that data point
out of the analysis?
Why?
Correlation, Violent Crime & Poor = 0.509
Correlation, Violent Crime & Single-Parent = 0.839
Correlation, Single-Parent & Poor = 0.0549

Standard Deviation, Violent Crime = 441.103
Standard Deviation, Poor = 4.584
Standard Deviation, Single-Parent = 2.121
b = 6.787
Interpretation?
Instead of adding in the percent single-family, what would have happened if I added a variable into the model that was highly correlated with the violent crime rate (Y) but uncorrelated with the first predictor variable (percent poor)?
The answer - there is no impact on X1 if you include an X2 that is uncorrelated with it!
Concept: Inserting Dummy
Variables into a Regression
Although we introduced regression analysis in the context of continuous measures, we can also put dummy variables - or dichotomous variables - into the model as predictors.
The interpretation of coefficients is analogous to the interpretation with continuous variables. A one-unit change in X - in this case, from "0" to "1" - is associated with a b-unit change in Y.
Dichotomous variables:
Black/white
Male/female
Over 25/under 25
In college/not in college
Concept: Standardized
Regression Coefficients
When we interpret regression coefficients, we interpret them in their original unit of measurement.

e.g., Education measured in years of education
e.g., Income measured in dollars
e.g., GPA measured in GPA points
e.g., Homeownership rate measured in percent homeowners
Why might we want standardized coefficients?
(And what are standardized coefficients?)
Think back to Z-scores ...

We said that every score has a raw score and a standardized score - or a z-score.

e.g., I got an 85 on the exam; my Z-score was 0.57.
e.g., My GPA is 3.20; my Z-score is -1.25

Z-scores allow us to compare across distributions. In this case, Z-scores allow us to compare across regression coefficients to determine which independent variable is a stronger predictor of Y.
Standardized Regression Coefficient
Standard Deviation, Crime = 28.193
Standard Deviation, Income = 4.682
Standard Deviation, Education = 8.858
Standard Deviation, Urbanization = 33.969
Interpretation: A one standard deviation change in X is associated with a Beta standard deviation change in Y.

Advantage: Compare across independent variables whose underlying units of measurements are different.

Disadvantage: Lacks the intuitive interpretation of unstandardized regression coefficients.
Standardized
Regression
Coefficients
Causation
Inferential statistics:
Are the differences observed in a sample
the result of true population differences,
or are they the result of chance alone
(sampling error)
Association:
We've asked how a change in one variable (X)
is associated with a change in another variable (Y).

e.g., Regression: about correlation; a one-unit change in X is associated with a b-unit change in Y.

e.g., Chi-Square: a test of statistical independence,
whether two discrete variables are related.
An oft-repeated mantra in the social sciences
is that
correlation does not imply causation
.
In other words, just because two variables
are correlated (there is an association) doesn't
mean that a change in one variable causes
a change in another variable.
Think: Ice cream sales and bottles of water sold
in the city. We know that they're associated - as
the number of ice cream cones sold rises, so too
does the number of bottles of water sold. However,
increasing the number of ice cream cones sold
does not
cause
the number of bottles of water sold to
increase!
Correlation does not equal causation.
Three criteria for causality:
1. An association.
2. Time ordering.
X must come before Y.

Example: Boys that are in the Boy Scouts have
lower rates of juvenile delinquency (e.g., arrests)
than those that are not in the Boy Scouts.
Am I less likely to be delinquent because I joined
the Boy Scouts, or did I join the Boy Scouts because
I'm unlikely to be a delinquent?

Here, the temporal ordering - which came first - is
ambiguous.
Example: Students that participate in programs
at the Center for Social Justice (CSJ) are more
likely to advocate for social issues.
Do I participate in CSJ programs because I'm an advocate
for social issues, or do I advocate for social issues because
I participate in CSJ programs?

Again, the temporal ordering is ambiguous.
3. Elimination of
Alternative Explanations
What work have you done to eliminate other explanations of the association?

When we're flying and the fasten seat belt
sign goes on just before it becomes turbulent,
we've satisfied the association requirement
(when the light's on, there's more likely to be
turbulence) and the temporal ordering (the
light came on before the turbulence happened),
but we know that one did not cause the other!
We have a
spurious
association, or a
third variable
problem!

If the weather looks turbulent (Z), the pilot is likely to
turn on the fasten seat belt sign (X) and there is likely
to be turbulence (Y). Therefore, the causal factor is
weather!
When we can measure the third variable,
we can control for it in our multiple regression
model (e.g., Urbanization was the third variable
(Z) that explained the relationship between
education and crime).
Example: I administer the same math test
to all of the kids in a middle school, and I find
a strong association between a student's height (X)
and his/her score on the math exam (Y). This
satisfies both the association and the temporal
ordering criteria ... but have I ruled out all the
alternative explanations? What might my third
variable be?
Age!

In this case, older students tend to be taller, as they've
had more years to grow.

Older students tend to do better at math, as they've had
more years to study.
Sometimes, the spurious (or third) variable is very
difficult to measure.

GPA --------> Lifetime Income

Does a GPA really cause higher incomes?

Or is there a third variable that's difficult to measure that causes both GPA and income to rise?
We often call these variables
unobservables.

What are some variables that are difficult to measure?
What variables might we have a hard time observing
in the population?
Beside spurious associations, we sometimes encounter
intervening
variables.

Here, we find that A causes B, and
B causes C, but we might only observe A and C.

A --------------------------- > C

Education -----------------------> Longer Life Span

Why does more education make people live longer?
A -----------> B -----------> C

Education -----> Increased Income --------> Longer Life Span

What's the reason?
Summary:



Spurious/Third Variable






Intervening Variable

Age
Height
Math Scores
Education
Income
Life Expectancy
Experimental Studies:
The
Gold
Standard
for Causal Inference

Why are experiments so important for
making causal claims?
Experimental Group
vs.
ControlGroup

With randomization, you can basically control for unmeasured difference between the two groups! As a result, researchers can control any difference
between the groups
Observational Studies:
More typical of social
science research

Example: I want to know whether homeowners
make for better citizens than renters. Do they vote
more often, volunteer in their communities, and
join community groups?
What are some differences (on average) between
homeowners and renters that we can observe?
What are some differences between homeowners
and renters that would be more difficult to observe?
Income
Homeownership
Voting
Community-Minded
Homeownership
Voting
Observed:
Unobserved:
Final Question: Why do we care
so much about causality?

Weather looks
turbulent
Pilot Turns Seat
Belt Sign On
Passengers Experience
Turbulence
How hard you work?
How much grit you have?
Your ability to focus?

Reading "Causal" Headlines
1. What’s the difference between a
one-sample t-test
and a
two-sample t-test
? Give an example of a research question in which researchers would chose each kind of test
.
2. Explain (in words) why researchers would want to run a two-sample t-test.
3. Can you ever know if you’ve committed a
Type I error
or a
Type II error
?

Why or why not?
4. Explain the concept of a
p-value
. When you run a one-sample t-test, what is the p-value telling you?
5. Explain the relationship between your
critical value
and your
p-value
.
6. What is the difference between the group mean and the grand mean in an
ANOVA
?
7. Researchers are going to test for differences in the average weight of preschool children in four different neighborhoods in Washington DC. State the
null hypothesis
and the type of test you would use.
8. What are the two things that
Pearson’s r
tells you about the relationship between two continuous variables?
9. How do you calculate the
expected frequencies
required for a Chi-Square test?
10. What does it mean for two variables to be
statistically independent
of each other when you're running a Chi-Square test?
11. What do researchers mean when they say that there is a
linear relationship
between X and Y?
12. Explain what "least squares" refers to in an
Ordinary Least Squares
regression.
13. Give an example of a
spurious relationship
that does not involve ice cream cones.
14. Explain why researchers never accept a null hypothesis (and instead, either reject or fail to reject).
Let's make a cross-tab for gender & year in this class.

In this class, there are ...
X male sophomore
X male juniors
X male seniors
X female sophomores
X female juniors
X female seniors

The General Social Survey gave
1,395 people the following statement and asked
them, "Do you strongly agree, agree, disagree, or
strongly disagree with the statement."

The statement is: " In the United States traditional divisions between owners and workers still remain. A person's social standing depends upon whether he/she belongs to the upper or lower class." Researchers want
to know whether agreement with this statement varies
between men and women.

Of the men in their sample, they found that 88 strongly agreed, 319 agreed, 162 disagreed, and 30 strongly disagreed. Of the women in their sample, 142 strongly agreed, 416 agreed, 189 disagreed, and 54 strongly disagreed.

They hire your group as research assistants for the project. Based on this information, what would you conclude? Are there differences between men and women on their agreement with the statement about class divisions?
Concept: Statistical Significance
Under the assumption of no effect (in other words, if the null hypothesis is true), the p-value tells you the probability of obtaining a test result that is equal to - or more extreme - than the actual value that you observed.


In hypothesis testing, alpha simply identifies a proportion of the curve (e.g., 0.05) at which we would reject the null hypothesis. In other words, it would be unlikely to find a test statistic in this region if the null hypothesis were true.

For each value of alpha, we can find a Z-score. Think of this as the cut-off point. This is the critical value. For any test statistic that is beyond this critical value, we will claim that our findings are
statistically

significant
.
When reading social research, you will often see stars (*) indicating the level of confidence in the estimates. Typically, * p< 0.10, ** p < 0.05, and *** 0.01. Those stars are associated with the level of confidence with which researchers can make their claims (aka, their statistical significance).

In this case, * means that we can reject the null hypothesis at a level of confidence of 90 percent. ** means that we can reject the null hypothesis at a level of confidence of 95 percent. *** means that we can reject the null hypothesis at a level of confidence of 99 percent.

These stars for significance give us a short-hand way of identifying how confident we can be in rejecting the null hypothesis.
Researchers are interested in the topic of abortion as a political issue. They are thinking about differences between Americans who identify as lower- or working-class, and those who identify as middle- or upper-class.

The General Social Survey asked Americans how concerned they are with abortion. On a scale of 1-4 (which we will treat as a continuous variable, with "1" being not concerned at all and "4" being very concerned), they found that the average score for respondents who identified as lower- or working-class (n=292) was 2.016 and the average score for respondents who identified as middle- or upper-class (n=217) was 1.959. Researchers also report that the standard error for the difference in means is 0.03.

Test whether or not lower- and working-class Americans are more concerned with abortion as a political issue than middle- or upper-class Americans.



In making rejection decisions about the null hypothesis, we have settled on two ways of making those decisions: comparing the p-value to alpha OR comparing the critical value to your test statistic.
We said that alpha is the level of error you're willing to accept, or the likelihood of making a Type I error.

When drawing random samples, it is possible - but unlikely - that we are going to draw a sample that falls far from the population mean. (Remember, 95% of random samples will fall +/- 2 standard deviations from the mean. [Technically, they will fall +/- 1.96 standard deviations from the mean, but we say 2 standard deviations because it's easier to remember!])


A Type I error occurs when we reject the null hypothesis, even though the null hypothesis is actually true.
When we conduct non-parametric tests by hand (e.g., Chi-Square), we will compare the test statistic (e.g., Chi Square) to the critical value to make a rejection decision. The Stata output also includes the p-value, which we can compare to alpha to make our decision.
In a two-tailed test ...
When alpha = 0.05, our critical value = 1.96
When alpha = 0.01, our critical value = 2.58

In a one-tail test ...
When alpha = 0.05, our critical value = 1.64
When alpha = 0.01, our critical value = 2.33
Imagine in a t-test testing for the difference in means between two samples and using a two-tail test, you get a test statistic (t) equal to 2.15.

Can you reject the null hypothesis with 95% confidence?
Can you reject the null hypothesis with 99% confidence?
Imagine in a chi-square test with four degrees of freedom, you get a test statistic (chi-square) equal to 10.44.

Can you reject the null hypothesis (that the variables are statistically independent) with 95% confidence?
Can you reject the null hypothesis (that the variables are statistically independent) with 99% confidence?
What is the relationship between the correlation coefficient (r) and the slope of the regression line (b)?

It seems that they both tell us something about the relationship between X and Y - two continuous measures.
When we made a scatter plot, we plotted the raw scores - your height in inches, years of education in years, weight of a diamond in carats, etc. When we use the raw scores in a scatter plot, the slope of the regression line (or best fit line) is b.

However, remember that every score has a raw score AND a standardized score (Z). The standardized score tells us tells us something about the score, relative to other scores in the distribution.

If we plotted the standardized scores on the scatter plot, rather than the raw scores, the slope of the regression line (or best fit line) would be r - the correlation coefficient.
As a result, we can think of the coefficient, b, as the unstandardized slope, and the correlation coefficient, r, as the standardized slope.
Running a Regression
in Minitab
reading social research
Reading Social Research
1. What is the research question? What is the hypothesis that McAdam and Brandt are testing?
2. What is the population McAdam and Brandt are studying? How do they collect their sample? What concerns do they express about a representative sample?

3. Who are the three groups that McAdam & Brandt collect data on? Why might McAdam & Brandt expect different outcomes for each of these three groups? In other words, what might explain differences between these groups?

9. From Table 1, explain the findings for civic attitudes across the three groups in the research. What test did the authors use to test for differences in civic attitudes in Table 1? What was their null hypothesis? Explain what the p-values indicate in this table.

In our projects, we have selected a level of alpha (e.g., alpha = 0.05 or alpha = 0.01) and made a rejection decision based on that level of alpha.

(Recall: Select alpha. Determine a corresponding critical value. Calculate a test statistic. Compare critical value and test statistic (or alpha and a p-value). Make a rejection decision.)

In most social research, we see stars indicating the level of alpha at which we can reject the null hypothesis. It is simply a short-hand way of making rejection decisions.

Remember: If p < alpha, we reject the null hypothesis.

+ p < 0.1
* p < 0.05
** p < 0.01
*** p < 0.001

The stars simply indicate whether the p-value is less than alpha for each test. For example, if your p-value was 0.04, we would indicate that with * (because 0.01 <
0.04
< 0.05). With 95% confidence, we can reject the null hypothesis. If your p-value was 0.007, we would indicate that with ** (because 0.001 <
0.007
< 0.01). At 99% confidence, we can reject the null hypothesis.

11. In Table 2, why do the authors use a chi-square test in the column evaluating "proportion doing service"? For one of the rows, explain the results of the chi-square test.
13. How would you graph the findings in Table 4? Why did you choose that kind of graph? What would your graph show us?
14. Explain the findings in Table 8. What type of test did the authors use? What was their null hypothesis? How would you explain their findings to a lay (non-statistical) audience)?
15. What is a logistic regression?
When we talked about Ordinary Least Squares (OLS) regression,
our outcome measures (or dependent variables) were continuous
.

- hours spent studying
- lifetime income
- score on an exam

In the multivariate framework, we used multiple predictor variables (independent variables) to evaluate changes in the dependent variable.

Our mantra (which, by now, you should have memorized) is that a one-unit change in X is associated with a b-unit change in Y.

But what if our outcome measure is dichotomous and we want to use multiple predictors?

- Voted (yes/no)
- Passed the course (yes/no)
- Accepted to Georgetown (yes/no)

Earlier, we used a chi-square test to determine whether two nominal variables were associated with each other, but we don't (yet) have a framework to understand the joint impact of multiple variables on a dichotomous outcome.

The logistic regression tells how a change in an independent variable affects the odds of a dichotomous event occurring.

- 1-unit increase in your high school GPA increases the odds of being accepted to Georgetown.

- Each additional hour spent studying is associated with an increase in the odds of passing the course.

- Holding a college degree is associated with an increase in the odds of voting in an election.

Remember: The odds are simply the probability that an event will occur divided by the probability that an event will not occur.


12. Table 3 reports t-tests for the difference in means between graduates and drop-outs (rows 1 + 2) and graduates and non-matriculants (rows 3 + 4). [Note: This table should probably be broken into two separate tables to clearly indicate the comparisons.] Explain why the authors used a t-test in this table. Explain their findings from Table 3, and the possible explanations they give.
16. If asked to summarize the findings from this article in just a couple sentences, what would you say? Write 2-3 sentences summarizing the findings from McAdam and Brandt's article.
15. Why do researcher create confidence intervals?
What is the proper interpretation of a confidence interval?
16. What happens to a confidence interval when researchers want more certainty? What happens to a confidence interval when the sample size increases? Why?
17. Of the 50 students in an Introduction to Sociology class, 9 earn As, 21 earn Bs, 15 earn Cs, and 5 earn Ds. Construct a frequency table, including the frequency, percentage and cumulative percentage.
The confidence interval is a range of values, centered around the sample statistic. It specifies the degree of confidence that we have that the true population parameter falls within that range.
Here is the correct interpretation of the confidence interval: If you were to repeatedly draw samples, the true population parameter would fall within the confidence intervals 95% of the time (for a 95% confidence interval).
Top Hat
: Give an example of a discrete
variable (either ordinal or nominal).
Top Hat:
For extremely liberal men, what is the difference between the expected value and the observed value?
Top Hat:
If you were testing for a relationship between
political party affiliation (Republicans vs. Democrats)
whether or not someone had a college education
(Yes vs. No), which test would you use?
Top Hat:
If you were testing whether the expected income for college students after graduation was the same across
African-American, Latino, Asian and white students, what
test would you use?
Top Hat:
If you were testing whether the number of hours spent studying each week was the same for students who participated in clubs on campus and those who didn't participate in clubs on campus, which test would you use?
Top Hat:
If you were testing whether the average number of times a student went to Mass annually at Georgetown was greater than 10, which test would you use?
Top Hat:
An example of two variables that
are uncorrelated? (Remember: Both
variables must be continuous.)
Top Hat:
If I want to predict the amount of
crime in a neighborhood (e.g., the number of
robberies per 1,000 households), give me
an example of an independent variable (X)
that might be related to the level of crime
in a neighborhood (Y).

What is the unit of analysis in this problem?
(Not on Top Hat)

If I randomly selected a neighborhood, and
knew nothing about the characteristics of that
neighborhood, what would be my best guess
of the level of crime in the neighborhood?
Concept: Regression
Formula

Top Hat:
What is the dependent variable (Y)?
Top Hat:
What is the regression coefficient (b)?
Question: How do you think the age that people have their first kid varies between blacks, whites and others? What test will we use to test the null hypothesis?
Question: Do you think the average age that people have their first kids varies by citizenship status (citizen vs. non-citizen)? Why or why not?
Question: Do you expect the level of social trust (people can always, usually, sometimes, or never be trusted) to vary by social class (working class, lower class, middle class, upper class)? Why? What test would you use to test it?
Question: Do you expect feelings about the Bible (word of God, inspired tales, book of fables) to vary by social class? Which test would you use?
educ = a + (b)paeduc

- What is the dependent variable?
- What is X?
- Finish this sentence: "A one-unit change in age is associated with a .... "
- Why isn't the Y-intercept a meaningful statistic to interpret?
- Write out the prediction equation.
- What is the predicted number of hours of television for a person who is 21 years old?
18. Explain when an outlier matters for the various measures of central tendency we used in the class, and the various tests that we have explored.
19. What's the difference between descriptive and inferential statistics?
20. What are the null hypotheses for an ANOVA, a t-test and a chi-square test?
21. In the School of Continuing Studies, there are 100 women studying Journalism, 123 women studying Urban Planning, 92 women studying human resources, 55 men studying Journalism, 72 men studying Urban Planning and 44 men studying human resources. Make a cross-tab. What statistical test would you use to test for an association between gender and degree program?
22. How do you determine critical values for Chi-Square tests, t-tests and ANOVAs?
Full transcript