**Statistics for Social Research**

**Professor McCabe**

**Measurement and**

the Collection of Data

the Collection of Data

**Describing Data and Distributions**

Visualizing Quantitative Data

**Introduction to Probability**

Concept: Observation and the

Unit of Analysis

Concept: Variable

Concept: Levels of Measurement

1. Nominal - Categories, Unranked

2. Ordinal - Categories, Ranked

3. Interval - Continuous, no true zero

4. Ratio - Continuous, true zero

A. Population of Towns in Pennsylvania

B. Religious Denominations

(Christian, Jewish, Muslim, Buddhist)

C. Grade in High School

(Freshman, Sophomore, Junior, Senior)

D. Score on an IQ test

A. Genres of movies

B. Average temperature in DC in August

C. Number of students in each major

at Georgetown

D. Ways to cook your steak

(Well-done, medium-well, medium, etc.)

A. Feeling Thermometer

(between 0 - 100, how do you feel about ... )

B. Positions in the Church Hierarchy

(e.g., Bishop, Archbishop, Pope, etc.)

C. Marital Status

(e.g., Married, Single, Divorced, Widowed)

D. Amount of money in your wallet

**Concept: Three Key Measures**

of Central Tendency

of Central Tendency

Mean

Median

Mode

**Concept: Mean**

**Concept: Median**

**Concept: Mode**

Concept: Collecting Data

Concept: Reliability

Concept: Validity

**Introducing Statistics**

Over 3 percent of people in Washington are HIV+.

More than 75 percent of people with HIV are African-American, even though African-Americans account for less than fifty percent of the population of Washington, DC.

In 2012, Washington, DC supported 110,000 tests for HIV, nearly triple the number of tests supported in 2006.

(Statistics from the DC Department of Health)

Several years ago, the team average batting average for the Washington Nationals was 0.258.

For players with at least forty at-bats, the batting average ranges from 0.318 (Jayson Werth) to 0.122 (Gio Gonazles).

Four players have at least 100 hits and a dozen home runs.

(Statistics from the Washington Nationals website)

Statistics refers to more than just the numbers and facts that are collected and reported; instead, statistics refers to the set of tools and techniques social scientists use for

collecting

,

describing

,

analyzing

,

and

interpreting

data about the world around us.

- How do we collect information on the HIV+ population in DC? (Hint: We don't know the HIV status of every person; instead, we make decisions about how to select a sample, make a claim about the population, etc.)

- How do we decide what baseball statistics to collect, record and make decisions based on? For every statistic we track, there are plenty of others that we don't track, or at least don't use in our decision-making process (e.g., when fans decide who should join All-Star teams, reporters decide who is "hot," or management decides who to sign).

Become thoughtful, careful young social science researchers, able to consider ways of collecting and analyzing data, and using quantitative data to make claims about the social world.

Become critical consumers of statistical information, questioning the source, content and claims of quantitative data from the world around you.

Course Goals:

•Identify ways that social scientists collect, describe and analyze data about the social world;

•Create and critically evaluate visual displays of information, including charts, graphs and other visual tools;

•Explain the importance of sampling for making statistical inferences about broader populations;

•Conduct various statistical tests for evaluating the relationship between variables;

•Differentiate between correlation and causation, recognizing the importance of causal inference to social research, as well as the limitations of generating casual estimates;

•Consume statistical information in your everyday lives with a critical eye toward the source of that data and the legitimacy of the research claims.

Why use statistics?

1. ... to describe events (Descriptive Statistics)

2. ... to make inferences about a population (Inferential Statistics)

3. ... to observe patterns or relationships among variables.

4. ... predict future events, or quantify the likelihood of a particular event occurring.

Wait! Can I do statistics if I'm not very good at math?

Yes! For this course, I will assume only a basic set of math skills.

Addition, Multiplication, Division

Squares, Exponents and Square Roots

Basic Linear Algebra

Some of the notation may be unfamiliar.

More important than your math skills, I think, are developing your critical thinking and analytical skills.

"There are three kinds of lies -

lie, damned lies, and statistics."

- Mark Twain, quoting British Prime

Minister Benjamin Disraeli

Descriptive Statistics

1. Winning times for the Olympics marathon.

2. Racial demographics of the Georgetown student body

3. The percentage of Washington, DC public elementary students passing their standardized exams each year.

Inferential Statistics - Using a sample, or a subset of a larger set of data, to draw inferences about the larger population

1. To test whether women approve of President Obama at higher rates than men.

2. To test whether attitudes on social issues (e.g., abortion, gay marriage or racial profiling) reach a certain threshold.

Concept: Independent

vs. Dependent Variable

1

Finding Statistics in Everyday Life

2

3

4

Well, what are statistics?

(And can I do statistics if I hate math?)

Fine, but why would I use statistics?

What should I expect to get out of this course?

Professor Smiley

Professor Sweater Vest

Professor Shabby Tie

**Professor**

Smiley

5 5 5 5 4

4 4 4 4 4

3 3 3 3 3

3 2 2 2 2

Smiley

5 5 5 5 4

4 4 4 4 4

3 3 3 3 3

3 2 2 2 2

Professor

Sweater Vest

5 5 4 4 4

4 4 4 4 4

4 4 4 4 4

4 4 4 3 3

Professor

Shabby Tie

5 5 5 5 5

5 5 5 5 5

5 3 3 3 1

1 1 1 1 1

Displaying Data Badly

Question: Can you explain the public

option (in the health care debate)?

**Blackjack**

**Roulette**

Probability Question #1:

If I randomly pick one card from a deck, what is the probability of picking either the two of hearts or the Ace of Spades?

Probability Question #2:

a) Assuming the chances of having boys and girls is the same, what are the chances my first child will be a boy?

b) Assuming my first child was a boy, what are the chances my second child will be a boy?

c) Assuming my first and second child are both boys, what are the chances that my third child will be a girl?

Probability Question #3:

Assume that I plan to have three children. Before having any children, what were the chances that my first child was going to be a boy, the second child was going to be a boy, and the third child was going to be a girl?

Now, regardless of the order, what was the probability that I was going to end up with two boys and a girl?

Probability Question #4:

The World Series is played until a team wins four games (with the maximum number of possible games being seven). Assuming that each team is equally likely to win each game, and that the games are independent events, what is the probability of having a four-game, five-game, six-game and seven-game World Series?

Length Theoretical Possibility Expected Number (out of 92) Actual Number

4 1/8 11.5 18

5 1/4 23.0 20

6 5/16 28.8 20

7 5/16 28.8 34

Polls for the 2012 presidential election showed that the race was within the margin of error.

At one point, President Obama held an eight-point lead among women, but Governor Romney held an eight-point lead among men.

Polling organizations eventually switched from a sample of registered voters to a sample of likely voters.

(Statistics from Gallup, July 30-August 19)

What is the "mark" of a criminal record? Does evidence of a criminal record change the likelihood that job applicants get interviews?

Are homeowners more likely to vote, volunteer or participate in community organizations than renters?

How have individual donors to political campaigns changed over the last thirty years? Are they more partisan? Do they give more money? Do they give to a larger number of candidates?

Are there differences in outcomes (e.g., educational success, behavioral problems, etc.) between children that grow up in stable, two-parent heterosexual households vs. those that grow up in stable, two-parent same-sex households?

How much does class size matter for kindergarten students? Do students perform better on standardized tests when they learn in small classrooms?

When we collect information, each of the individuals or subjects in our research represents a unique

observation

.

In research, we need to think about the unit we are analyzing. At what

level

do we collect, analyze and measure data?

e.g., households vs. individual (income)

e.g., colleges vs. college students

Variables

are the characteristics that vary from one observation to another.

Hair color

Eye color

Favorite movie

Worst fear

College GPA

Number of siblings

Annual salary

Top Hat Question:

What are some variables that

we could measure at the unit of the

country?

In other words, what is a characteristic

of countries that varies from one

country to another?

Color of house

Number of occupants

Sales price

Year built

Number of bedrooms

Dependent variables

are the variables whose variation we are trying to explain.

Why do some students score better on standardized tests than other students in DC public schools? (Standardized test scores)

Why do some people make more money than other people? (Income)

What explains why people have different BMIs (or weights, relative to their heights)? (BMI)

Independent variables

are the variables used to predict variation in the dependent variable, or those that are related to the dependent variable.

Does a student's race predict their performance on standardized tests? (Race of student)

Do people with more education tend to make more money? (Level of education)

Does your proximity to a local grocery store predict your BMI? (Distance to grocery store)

Discrete

Continuous

Measurement

Good measurement must be ...

1. Reliable.

2. Valid.

3. Exhaustive (Discrete)

4. Mutually Exclusive (Discrete)

Reliability

refers to the consistency of a measure, whether it produces the same result across time.

Validity

refers to whether the measurement you use actually gets at the concept you're trying to measure.

Concept: Exhaustive

For discrete (ordinal or nominal) variables, we want to make sure that the response options cover all possible outcomes. When all potential responses are included in one of the categories, we can say that the response options are

exhaustive.

For discrete (ordinal or nominal) variables, we want to make sure that all response options fit into one and only one category. In other words, there is no ambiguity about how a response should be coded. When responses fit into one and only one category, we can say that the response options are

mutually exclusive.

Concept: Mutually Exclusive

Concept: Measurement Error

Concept: Organizing Data

Concept: Types of Data

How do social scientists collect data about the social world?

Experimental studies vs. Observational studies

Natural experiments

Controlled experiments

Surveys

Administrative Data

Content analysis

1. Cross-sectional data

the Social Capital Community Survey

a survey conducted by GUSA about campus facilities

A Survey Monkey survey you completed.

2. Repeated cross-sections

the General Social Survey

the American National Election Survey

3. Longitudinal (or Panel) data

the Panel Study of Income Dynamics (PSID)

annual World Bank country indicators

When we organize data for statistical analysis,

we typically organize the

observations in rows

and the

variables in columns

.

Statistical programs, including Minitab, Excel, SPSS and Stata, should make this organization intuitive.

Data (and particularly data collected in a survey) typically comes with a

codebook

that describes the content of the dataset in more detail. The codebook includes information on how the data were collected, the response options for discrete categories, ways the data are coded, information on missing values, etc.

Concept: Frequency Distribution

The

frequency distribution

is a way of understanding all of the observations that share a common property. It displays the

frequency

- or the number of times - that a particular property occurs among observations.

Imagine that in a class, there are eleven seniors, fourteen juniors and one sophomore.

(cc) image by anemoneprojectors on Flickr

Concept: Percents, Proportions,

Rates and Ratios

The

proportion

is the number of items in a group relative to the number of items in total. It is expressed in decimal form.

The

percent

is simply the proportion multiplied by 100.

The

ratio

expresses the comparison of one subgroup to another subgroup (rather than one subgroup to the whole).

Construct a Frequency Table

(Because the variable is nominal, rather than ordinal, you don't need to include the cumulative frequency or the cumulative percentage.)

Number of Medals Won in the 2012 Olympics by Continent

446 - Europe

238 - Asia

166 - North America

34 - Africa

29 - South America

48 - Oceania

Concept: Percentiles

The

percentile

is the value of a variable below which a certain percentage of observations fall. For example, the 25th percentile would be the score or value below which one-quarter of scores (on an exam, for example) fall.

Common percentiles include quartiles (25/50/75), quintiles (20/40/60/80) and deciles (10/20/etc.)

Review: In a class of twenty students, exams on the midterm were as follows:

76 78 78 80 82 82 84 86 89 89

90 92 92 92 93 93 94 96 96 98

Using five-point intervals (e.g., 70-75, 76-80, 81-85, etc.), create a frequency table to describe scores on the midterm. Include both the cumulative frequency and the cumulative percentage.

The measure of

central tendency

are used to tell us something about the normal, typical or average score in a distribution of scores.

Concept: Distribution

The distribution tells us about the frequency that scores occur within any dataset. It lays out and clarifies the set of scores from the data. In some cases, like an exam of twenty students, it is easy to see all the numbers in the distribution. In other cases, there may be too many observations to actually see all the values in the distribution.

But what if we looked at the amount of money every individual donated to political candidates in the 2010 election? There would be millions of observations, and writing them all out would be tedious, boring and unnecessary ...

$25

$200

$35

$100

$60

$60

$55

$320

$100

$105

$140

$20

$10

$100

$200

$330

(Yawn, this is boring)

The

mean

is equal to what you typically think of as the

average.

It is equal to the sum of the scores divided by the total number of scores.

76 78 78 80 82 82 84 86 89 89

90 92 92 92 93 93 94 96 96 98

Sum of scores = 1760

Number of observations = 20

Mean = 1760/20 = 88

Notes: The mean only be used on continuous variables; it can't be used to understand nominal variables. The mean is skewed by outliers. Outliers are scores that are extremely large or extremely small, relative to the rest of the distribution.

Imagine that the lowest two scores were 10 and 15, rather than 76 and 78.

Sum of scores = 1631

Number of observations = 20

Mean = 1631/20 = 81.55

**Concept: Weighted Mean**

When you're combining group means from groups of different sizes, you can't simply average the means! Instead, you need to take a

weighted mean,

weighted by the size of each group.

Group 1: Women

N = 12

Mean Exam Score = 84.5

Group 2: Men

N = 8

Mean Exam Score = 93.25

If I asked you the mean exam score for the class, you can't simply average the two scores.

It is not just (84.5 + 93.25)/2=88.75.

Instead, you must weight each mean by the number of observations and take the weighted mean.

The proper calculation is ((84.5*12)+(93.25*8))/20=88

The

median

is the

middle score

in an ordered distribution. It is the score that divides the distribution equally in half.

76 78 78 80 82 82 84 86 89 89 92 92 92 93 93 94 96 96 98

The

mode

is the score that occurs most frequently in the distribution.

76 78 78 80 82 82 84 86 89 89 90 92 92 92 93 93 94 96 96 98

Notes: The median is insensitive to other scores in the distribution. Again, you can't use the median with nominal variables.

76 78 78 80 82 82 84 86 89 89 92 92 92 93 93 94 96 96 98

26 28 28 30 32 32 34 36 39 89 92 92 92 93 93 94 96 96 98

The median is the same, even though the bottom nine scores in the distribution have changed substantially. (What would happen to the mean in this example?)

Notes: The mode can be used to talk about the most frequent score, but it doesn't tell us anything about the scores that occur around those scores!

Question: How do we pick between the measures of central tendency? When is the mean the best measure, or when is the mode or median the best?

We often look at course evaluations to determine the "best" professor to take. For each of the following three professors, calculate the mean, median and mode of their course evaluations. Looking at the data, interpret the meaning of these measures of central tendency.

Statistics Lectures

Introducing Statistics

Measurement and the Collection of Data

Describing Data and Distributions

Visualizing Quantitative Data

Z-Scores, Probability and the Normal Distribution

**End!**

**Concept: Measures of Dispersion**

(or Measures of Variability)

(or Measures of Variability)

When we have a distribution of scores, the

measures of dispersion

(or variability) tell us how the scores are spread around the mean (or another measure of central tendency). In doing so, these measures tell us about the shape of the distribution. Instead of describing the scores, as we do with a measure of central tendency, we are now concerned to describe the way the scores are spread relative to each other.

**Concept: Range**

The

range

is simply the distance between the minimum and the maximum score in a distribution.

Exam 1: 89-80 = 9

Exam 2: 99-72 = 27

**Concept: Deviation**

The

deviation

represents the difference between any particular observation (xi) and the mean. For each observation, the deviation tells us the distance from that observation to the mean. It can be positive (if the score is greater than the mean) or negative (if the score is less than the mean).

**Concept: Variance**

To

variance

is the average sum of the squared distance from each score to the mean. On its own, the variance is not a particularly useful statistic, but it is an important step along the way.

To calculate the variance ...

1. Calculate the difference between each score and the mean.

2. Square each difference (or deviation). (Note: Squaring them makes all values positive!)

3. Add up the squared differences.

4. Divide by the number of observations.

1. Difference between each

score and the mean

2. Square each difference!

(Notice they're all positive!)

3. Add up the squared differences

4. Divide by the number of observations

**Concept: Standard**

Deviation

Deviation

To calculate the

standard deviation

, simply take the square root of the variance. The standard deviation is the central statistic that tells us how the scores are spread around the mean in a distribution.

Exam 1 - Standard Deviation

Exam 2 - Standard Deviation

The standard deviation is lower for Exam 1 - where the scores were all bunched closer to the mean - than for Exam 2 - where the scores were spread farther away from the mean.

Example: There are two judges, both of whom sentence criminals charged with misdemeanors. While the mean sentence both judges give the same - 18 months in jail - the standard deviation is very different. One has a very small standard deviation, while the other has a very large standard deviation. What does this mean?

Rules about the standard deviation:

1. Standard deviation is always greater than or equal to zero!

2. Standard deviation is only equal to zero when the all the values in a distribution are the same; in other words, when there is no variation is scores!

3. The greater the variability is scores around the mean, the greater the standard deviation.

**Concept: Normal Distribution**

**Concept: Z-Scores**

(or standardized scores)

(or standardized scores)

**Concept: Skewed**

Distributions

Distributions

Why do we call it standardized? Well, you can't compare apples and oranges ...

ACT vs. SAT

Both college entrance exams.

However, one point on the ACT (ACT point) does not equal one point on the SAT (SAT point).

Different units of measurement.

In order to compare them, you need to standardize the measurements.

A student who scored 2 standard deviations above the mean SAT score got a 1,100 on the exam.

A student who scored 2 standard deviations above the mean ACT score got a 24 on the exam.

The student who scored a 1,100 on the SAT did as well, relative to the average grade, as a student who scored a 24 on the ACT. They are both 2 standard deviations above the mean.

Now that we know about the standard deviation, we can begin to think about standardized scores, or

Z-scores

.

To simplify the concept, you can consider that every score can be represented in two ways - as a

raw score

or as a

standardized score

(Z-score).

The raw score is the score in the original units of measurement.

You weight 140 lbs.

You scored 85 points on the exam.

You have an IQ of 105.

You have $1,250 in your bank account.

You are 72 inches tall.

These scores are all presented in their original units of measurement - pounds, exam points, IQ points, dollars or inches. Note that these scores tell us nothing about how someone scored

relative to everyone else in the distribution

.

Each of these scores also has a corresponding standardized score, expressed as the number of standard deviations the score falls from the mean score in the distribution.

Your weight is 1.2 sd above the mean.

Your exam score is 1 sd below the mean.

Your IQ is 0 sd from the mean (meaning you have the average IQ score).

Your bank balance is 2 sd below the mean.

Your height is 0.2 sd above the mean.

Note that these scores are all expressed in standard deviation units. We can claim that your height falls much closer to the average height than your bank balance, which is actually pretty far from the average bank balance (even though height is measured in inches and bank balances are measured in dollars)!

Calculating the Z-Score

Example: The average score on the Physics midterm was an 82. Smart kid that you are, you scored a 92. The professor calculated the standard deviation and told you that it was 6. What is your Z-score?

Wait! I've calculated a Z-Score, showing that I scored 1.67 standard deviations above the mean on the Physics exam, but what does that actually mean?

Question: On the next exam, you again score a 92 and the class average is again an 82. However, the standard deviation has changed to 10. What does this mean for the spread of scores on the exam? What does this mean for your score, relative to the other scores?

This is the most common curve you will see in statistics. It is called the normal distribution.

68%

95%

68 percent of scores fall

within 1 sd of the mean!

95 percent of scores fall

within 2 sd of the mean!

99 percent of scores fall

within 3 sd of the mean!

99%

The

normal distribution

- also known as the bell curve - is a symmetrical curve defined by two statistics - the mean (mean = mode = median) and the standard deviation. In the curve, half of all observations fall above the mean and half fall below the mean. Many social phenomenon (e.g., intelligence, height, etc.) approximately follow the normal distribution. As you get farther from the mean score, you will find fewer and fewer observations.

The three rules of thumb!

2.10 2.40 2.70 3.00 3.30 3.60 3.90 Raw Score

The GPA Distribution of Students at College X

What is the average GPA?

What is the standard deviation?

For a student with a GPA of 3.45 (raw score), what is her standardized score?

If a student has a GPA 1 standard deviation below the mean, what is his GPA?

What percentage of GPAs fall between 2.70 and 3.30?

95 percent of GPAs fall between what raw scores?

Would it be common to find a student with a GPA of 2.00 or below?

When scores are normally distributed, the right tail of the distribution is the same length as the left tail of the distribution, and the mean=median=mode.

However, we will sometimes find social phenomenon that are not normally distributed. This is because some observations have extremely high or extremely low scores, thereby making it so that the mean, median and the mode are not equal to one another.

Example of a Right Skew (or Positive Skew): In the United States, income is not normally distributed because some people make millions of dollars. What happens to the mean, median and the mode when you have some outliers at the top of the distribution?

Example of a Left Skew (or Negative Skew): On a final exam, a handful of students do really poorly, getting extremely low grades relative to everyone else. What happens to the mean, median and the mode when you have some outliers at the bottom of the distribution?

16 Books

37 Books

Between 2008 and 2013, the graphical representation of books increased by 131%. It more than doubled, from 16 to 37.

However, tuition increased by only 16 percent - from $47,908 to $55,640.

Thus, the visual representation is extremely misleading.

30 stick figures

13 stick figures.

11.5 stick figures.

18 stick figures.

The ratio of stick figures from 2013 to 2015 (11.5: 30) makes it look like the yield nearly tripled! In fact, the yield went up by only 15%.

Rules for Displaying

Data Well

Concept: Bar Graphs

Concept: Line Graphs

A

bar graph

is a visual display of discrete categories (either nominal or ordinal) where the

length of each bar

represents the

percentage of frequency

of a category.

Title: The percent of people (age 12 or older) who report using illicit drugs last month, by type of county.

Source: 2010 National Survey of Drug Use and Health

Source: 2010 National Survey of Drug Use and Health

Number of users (age 12 and older) with dependence of abuse, by drug type.

**Concept: Histograms**

A

histogram

is a visual display for

continuous data (interval/ratio)

where the scores are presented along one axis and the frequency (or percentage) of that score is presented along the other axis. Often, continuous data are recoded into categories before the construction of a histogram (e.g., a continuous GPA may be recoded into intervals of 0.10). Histograms are often used to show the distribution of continuous data.

Average annual count of evicted tenants, by gender and neighborhood racial composition

Source: Desmond 2012

How could you improve

the quality of this graph?

Predicted Probability of Trusting Various Social Groups, by Homeownership Status

Source: McCabe 2012

A

line graph

is a visual display of data typically used to track a social phenomenon across time, or some other continuous measure.

Concept: Pie Graphs

Pie Charts are good for making representations of Pac Man, but aren't particularly good for displaying statistical information. The reason is two-fold. First, and most importantly, pie charts (like bar charts or histograms) can tell us about the relative relationship between two variables, but tell us nothing about their frequency. Second, it is often difficult to correctly visualize the relative size of a piece of the pie.

Top Hat:

Which Type of Visual

Tool Should I Use?

6. I want to show the distribution of GPAs for the students at Georgetown.

Using Statistical Tools to Describe and Visualize Quantitative Data

Open and describe the data (e.g., the number of variables, observations, missing values, etc.).

Sort the data according to particular variables in the data.

Recode continuous measures into discrete measures (e.g., continuous age measure into categorical age measure).

Get basic descriptive statistics (e.g., measures of central tendency, measures of dispersion).

Create data visualizations (bar charts, histograms, line graphs, etc.)

How do you play roulette (and why is does that woman look like she's having so much fun)?

**Concept: Probability**

Probability

refers to the likelihood that a particular outcome will occur over a long sequence of observations. It is equal to the proportion of times we expect a particular event over a large number of trials.

Notation:

P(A)

refers to the probability of event "A" occurring. For example, in the flip of a fair coin,

P(Head) = 0.50

. In a class of twenty-five students where ten of the students are sophomores, the probability of picking a sophomore when randomly selecting a student =

P(sophomore) = 0.40

.

**Concept: Probability**

Rules (or Probability

Rules!)

Rules (or Probability

Rules!)

1. The probability of an event occurring equals the number of successful outcomes divided by the total number of possible outcomes.

P(A) = Number of Successful Outcomes/Number of Total Outcomes

2. The probability of an event occurring always ranges between 0 and 1.

3. Converse Rule: The probability of an event not occurring is equal to 1 minus the probability of that event occurring.

P(not A) = 1-P(A)

4. Addition Rule: If A and B are distinct outcomes with no overlap, then the probability of either getting A or B is equal to just adding up the probability of both outcomes.

P(A or B) = P(A) + P (B)

5. Multiplication Rule: The probability of getting a combination of events is equal to the probability of their separate occurrences.

If A and B are

independent

events,

then P(A and B) = P(A) * P(B).

4a. Adjusting for Joint Occurrence: If an event double-counts, we have to make a correction to eliminate double-counting events. In this case, we simply subtract out the joint occurrences.

5a. Conditional Probability: If A and B are both possible outcomes, then P(A and B) = P(A) * P(B given A)

If I randomly select one student in this class, what is the probability he or she will have a Georgetown ID?

If I randomly select one student in this class, what is the probability he or she will already have a bachelors degree?

P(King): The probability of selecting a King from a deck of cards is 0.0769.

P(Not King): The probability of not selecting a King from a deck of cards is 0.9231.

Concept: Probability Distribution

**Concept: Random Variable**

**Concept: Probability**

and the Normal Curve

and the Normal Curve

The

rate

is the frequency of an occurrence, relative to a base number (measured in the 10's, 100's or 1000's, etc.)

Seven cities with the highest murder rate (2010)

(Note: the murder rate is the number of murders

per 100,000 people

)

1. New Orleans - 49.1

2. St. Louis - 40.5

3. Baltimore - 34.8

4. Detroit - 34.5

5. Newark - 32.1

6. Oakland - 22.0

7. Washington, DC - 21.9

Can we measure

grit

?

Concept: Unit of Measurement

What is the unit you're measuring?

For income, you're measuring in dollars.

For standardized test scores, you're measuring in points.

For education, you're measuring in years of schools.

Nominal

Ordinal

Interval

Ratio

(Special case: Dichotomous/Dummy [yes/no])

Note: It is possible to make continuous variables into discrete categories. For example, you could have a continuous age variable (e.g., 18, 19, 20, 21, 22, 23, 24 etc.) and recode it into a categorical variable (e.g., 18-21, 22-30, 31-40, etc.)

Person A: Weighs 150 lbs!

Weighs himself five times and gets the following scores ...

130

145

150

120

160

Weighs himself five times and gets the following scores ...

125

126

125

126

124

1

2

What makes the Census question about race exhaustive?

Do you think the Census question about race is mutually exclusive?

Measurement is imperfect.

Sometimes the instruments are imperfect (e.g., the scale is slightly off). Sometimes people are misleading about their responses (e.g., they might under-report their weight, or over-report their voting behavior).

While we work to minimize measurement error, there is the possibility for error in all measurement.

1. Imprecise tools.

2. Poorly worded questions, surveys.

3. Interviewer biases.

4. Respondent biases (e.g., social desirability).

5. Coding/processing errors.

Using the last names of students in this class, construct a frequency table.

Final Activity:

How do we measure poverty

in the United States?

We often hear accounts of poverty, or the percentage of Americans who live in poverty, but what are the actual indicators used to measure poverty?

"X bar" is the mean!

"Sigma" is to sum everything

or to add them all up!

"X i" (or X subscript i) is the ith

observation in a dataset

"N" is the total number of

observations in that dataset

10 15

76 78

78 80 82 82 84 86 89

89 90 92 92 92 93 93 94 96 96 98

10 15

76 78

78 80 82 82 84 86 89

89 90 92 92 92 93 93 94 96 96 98

Outliers!

Outliers are scores the are markedly different from the rest of the scores in the distribution. They distort calculations of central tendency, like the mean.

Looking back at these scores ... In which distribution do the scores typically fall closest to the mean?

Example: Home Prices in DC

To calculate the variance

1. Subtract each observation

from the mean (to get the deviation)

2. Square the deviation

3. Add up all of the squared

deviations ("Sum of Squares")

4. Divide the Sum of Squares

by the number of observations

to get the variance.

(Note: These are two normal curves.

The peak of each curve is the mean.

When the scores are bunched closer

to the mean, the standard deviation

is small; when the scores are spread

wider from the mean, the standard

deviation is larger. We will learn

about the normal curve soon.)

Two student in two separate Introduction to Sociology course both score a 90 on their exam. Are all 90s created equally?

If a person in class A got a 90, but the mean was 95, he did below average; if a person in class B got a 90, but the mean was 75, she did well above the average.

Even though the raw scores are the same (because 90=90), the students scored very differently

relative to the rest of the students in their class!

Example: Calculate

the measures of dispersion.

There are six jobless households. They have

received unemployment benefits for the following number of weeks:

9 8 6 4 2 1

On

Top Hat

, calculate the mean, the range, and the standard deviation for the distribution.

Mean = 5

Range = 8

Standard Deviation = 2.94

The standardized score

for a value of X

The difference between

X and the mean of X (X-bar)

The standard deviation

Things to remember about

the normal curve ...

1. The entire area under the curve

is always equal to 100 percent!

2. The peak of the normal curve is

the mean of the distribution. In a

normal distribution, the mean = mode =

median.

3. Half of all observations fall above the mean.

Half of all observations fall below the mean.

4. Nearly all scores - more than 99 percent of them! - fall within three standard deviations of the mean

(+/- 3 sd). It is very unlikely to find an observation with a score more than 3 sd from the mean.

5. About 95 percent of scores fall within 2 sd of the mean. About 68 percent of scores fall within 1 sd of the mean.

What percentage of scores fall

between 0 and 1 standard

deviation

above

the mean?

What percentage of scores fall

between 0 and 2 standard

deviations

below

the mean?

What percentage of scores fall

between 0 and 3 standard

deviation

above

the mean?

**Normal Distribution Table**

When you calculate a z-score, it often won't be a nice, even number (e.g., 1, 2 or 3). You may get a z-score of 1.45, or a z-score of -2.05.

In the Normal Distribution Table, you will find z-scores and the corresponding area under the normal curve for each z-score.

Top Hat:

On the SATs, Person A scores 1.5 standard deviations above the mean. Assuming SAT scores are normally distributed, what percentage of observations fall at or below her score?

Top Hat:

A parent brings her child to get measured (height) and weighed. The child's height is 0.20 standard deviations below the mean. What percentage of children fall within 0.20 standard deviations on either side of the mean?

Top Hat:

The average strawberry weighs 8.5 ounces, with a standard deviation of 1 oz. A worker finds a strawberry that weighs 11 ounces! After calculating the z-score, what percentage of strawberries would weigh more than this one?

Scores

Frequency (# of times it occurs)

Here's what you should be able to do by now ...

- Distinguish between levels of measurement (e.g., nominal, ordinal, interval, etc.).

- Talk about the challenges of measurement (e.g., reliability, validity, mutual exclusivity, etc.) in the social sciences.

- Calculate rates, ratios, percentages and proportions.

- Make a frequency table, including recoding continuous variables into discrete measures.

- Calculate the measures of central tendency (mean, median and mode) and talk about their limitations.

- Calculate the measures of dispersion (range, deviation, standard deviation).

- Explain the standard deviation.

- Calculate a z-score and explain the difference between a z-score and a raw score.

- Use the normal distribution table in the back of your book when calculating z-scores.

- Distinguish between types of graphs and visual displays, including when we should use each type to display our data.

Top Hat:

We are studying the distribution of annual income for cashiers at fast food chains. The mean income is $14,000 and the standard deviation is $1,500. Assuming a normal distribution, what percentage of cashiers earn between $14,000 and $16,000?

1. Visualizing Data Badly

2. Creating Good Data Visualizations

(Bar Charts, Line Graphs, Histograms)

3. Crash course on statistical programs

(e.g., Excel, MInitab)

1. Open & Describe the Data

2. Sort the Data

3. Recode Variables

4. Descriptive Statistics

5. Data Visualizations

Opening Data:

FILE --> Open Project

Basic Descriptives:

STAT --> Basic Statistics -->

Display Descriptive Statistics

Sorting Data:

DATA --> Sort

Recoding Data:

DATA --> Code

Descriptive Statistics:

CALC --> Column Statistics

Z-Scores:

CALC --> Standardize

Data Visualizations:

GRAPH --> Bar Chart

GRAPH --> Line Chart (or Time Series Plot)

GRAPH --> Histogram

1. Axis scales can be misleading!

Don't manipulate the scale in a way

that makes people read the figure differently.

2. The visual images should be

correctly sized to the data. Otherwise

we end up with perceptual distortion

(where the image tells a story different from the actual data).

3. Consistent scales & consistent axes!

(Units across the axes must be the same.)

4. Enough information, but not too much information. (Avoid data junk.)

1. Select the appropriate type of graph.

2. Clearly label your axes.

3. Ensure consistent scales on the axes. Include a legend (where appropriate).

4. Write titles that identify the information in the chart. (We should be able to "read" the chart without any accompanying text).

5. Avoid perceptual distortions.

6. Minimize data "junk". (This includes excess colors, symbols, and information not directly related to the data story itself.)

7. Remember: Data visualizations are being used to tell a story!

Explain Each Visual Display of Data ...

1. For every year since 1980, I want to graph the acceptance rate at Georgetown.

2. I want to plot the number of murders committed in each of the four quadrants (NW, NE, SW, SE) in DC last year.

3. I would like to show the distribution of SAT scores for students from public and private schools.

4. I want to show voter participation rates for different age groups.

5. I would like to compare the incarceration rates in each state in the United States.

From a deck of cards (52), the probability that

I will randomly select a King = P(King) = 4/52 = 0.0769

P(King or Queen) = P(King) + P(Queen) = 4/52 + 4/52 = 0.0769 + 0.0769 = 0.1538.

P(King or Heart ) = P(King) + P(Heart) - P(Joint Occurrence) = 4/52 + 13/52 - 1/52 = 16/52 = 0.3077.

P(King and Tails) in two separate draws: P(King) * P(Tails) = 4/52 * 1/2 = 4/104 = 0.0385

with Replacement

vs.

without Replacement

Question: What is the probability of selecting two Kings in a row? Here, replacement matters!

Do you put the card back in the deck before selecting the second card, or do you select from the remaining 51 cards?

P(Ace and Ace) with replacement?

P(Ace and Ace) without replacement?

In a randomized events (e.g., child birth, flipping coins, etc.), the probability distribution describes the likelihood of each possible outcome of the event. The probability distribution is analogous to the frequency distribution,

except that it is based on the number of expected occurrences in the long-term (based on probability theory) rather than the actual number of occurrences (as described by empirical evidence).

The normal curve - which you have already seen - is basically an ideal or theoretical model showing the probability that particular events will occur. When we calculated z-scores and looked at the area under the curve, we were figuring out how likely it was that particular events would occur.

This is an application of the rules of probability.

Percentile Rankings: We can also use the area under the normal curve to think about percentiles, knowing the

percentage of the population that falls at or below

a certain score.

Question: How likely is a score to fall +/- 1 standard deviation from the mean?

P (+/-1sd) = P(-1<z<0) + P(0<z<1) = .3413 + .3413 = .6826

Question: What is the percentile ranking for someone with a score 1 standard deviation above the mean?

P(<1 sd) = P(z<0) + P(0<z<1) = .5000 + .3413 = .8413 = 84th percentile = 84 percent of scores fall at or below this score.

Question: How likely is a score to fall between 2 and 2.5 standard deviations above the mean?

P(2<z<2.5) = P(0<z<2.5) - P(0<z<2) = 0.4938-0.4772 = .0166

Question: If you scored 1.4 standard deviations above the mean, what is your percentile ranking?

P(z<1.4) = 0.5000 + 0.4192 = .9192

Percentile ranking is the 92nd percentile. You scored at or above 92 percent of people.

If I flip a coin twenty times and it lands on heads 12 times (and tails 8 times) ....

Probability Frequency

Heads 0.50 0.55

Tails 0.50 0.45

**Your turn!**

Questions?

Comments?

Confusions?

Thoughts?

Questions?

Comments?

Confusions?

Thoughts?

To observe patterns or relationships among variables

1. What is the relationship between how high students rate a course and how easy they think it is?

2. What is the relationship between the number of crimes committed in a neighborhood and the number of bars and restaurants?

3. What is the relationship between whether or not a basketball player made his last free-throw and whether or not he makes his next one?

4. What is the relationship between your parents' income and your future income? (Do you think this varies depending on your race, where you grew up, or the year you were born?)

To predict future events ...

1. Can we use statistics to quantify the likelihood of a political candidate winning, or a basketball team winning?

2. Can we use statistics to quantify the likelihood that it will be sunny, or that it will snow?

Comparison: Bar Chart vs. Pie Chart

**Concept: Odds**

The

odds

refer to the probability that an event will occur relative to the probability that an event will not occur.

If there are twenty women in the class and ten men,

the odds of randomly selecting a woman is 2:1.

If I randomly pick a day of the week, the odds that I do not pick Tuesday are 6:1.

Question 1a: What is the probability that the ball will land on black?

Question 1b: What are the odds that the ball will land on black?

Question 2a: What is the probability that the ball will land on 21?

Question 2b: What are the odds that the ball will land on 21?

Question 3a: What is the probability that the ball will land on 28 or 31?

Question 3b: What are the odds that the ball will land on 28 or 31?

Question: How likely is a score to fall +/- 1 standard deviation from the mean?

Question: What is the percentile ranking for

someone with a score 1 standard deviation above

the mean?

Question: How likely is a score to fall between 2 and 2.5 standard deviations above the mean?

Question: If you scored 1.4 standard deviations above the mean, what is your percentile ranking?

Question: Sylvia has an IQ of 95. Herman has an IQ of 101. Assuming that IQ scores are normally distributed, with a mean of 100 and a standard deviation of 15, what percentage of the population has an IQ score between Sylvia and Herman?

Question: Sylvia has an IQ of 95. Herman has an IQ of 101. Assuming that IQ scores are normally distributed, with a mean of 100 and a standard deviation of 15, what percentage of the population has an IQ score between Sylvia and Herman?

P(-0.333<z<0.067) = P(-0.333<z<0) + P(0<z<0.0667) = .1293 + .0279 = .1572

Top Hat:

How do we measure race?

What about the idea of the "middle class"?

What types of characteristics would I look

for to determine whether someone was

in the middle class?

Back to the middle class example ...

What is a measure of middle class status

that accurately gets at the concept we're

trying to measure?

Top Hat: Class Survey

Top Hat: Group Project Assignments

You will find a list of fifteen potential

research topics and an ordered ranking

(1-15) to match. Read through the list, and

rank the topics that interest you (with

1 being the most interesting and 15 being

the least interesting). These rankings will

be the basis of your group project assignments.

Top Hat Review: Percentiles

Top Hat Review: Outliers

Top Hat Question: Describe the

distribution of wealth in the

United States

Top Hat:

Square Root

What are the some of the pitfalls that Wainer

(in Displaying Data Badly) gives in his exposition

of the rules for displaying quantitative data?

Note: Sociological research on the

likelihood of having a third child

if your first two children are of the

same sex.

Do you think parents are more likely or

less likely (or that the likelihood is

unchanged) to have a third child when

their first two children are of the

same sex?

**Concept: Standardized Scores**

- What is the difference between age and cohort? Why do we often study cohorts?

**Z-Scores and the Normal Distribution**

Answer These Questions on

Top Hat

Key GSS Findings on Race:

Broad support for measures of racial tolerance.

Despite accepting integration in principle, whites express a strong preference for social distance.

Little appetite for government programs to end inequality and segregation.

Accounts of inequality are more cultural than biological.

Key GSS Findings on Political Attitudes:

Decline in Democratic identification, and a rise in the Independent identification.

Ideological polarization increased.

Conservative movement was not rooted in massive shift toward more conservative attitudes or beliefs.

Top Hat:

After receiving a score of 1420 on the SAT, a student learns that she is in the 88th percentile of test-takers. What is her z-score?

Question: What types of variables would you expect to be skewed in the United States? Which types of variables might have a really long left tail, or a really long right tail?

Top Hat:

The average height for women is 64.5 inches (5 ft, 4.5 inches) in the United States. The standard deviation is 3.5 inches. Since height is normally distributed, what percentage of women would you expect to be between 62 and 66 inches tall (5 ft., 2 inches to 5 ft., 6 inches)?

Top Hat:

The average height for men is 70 inches (5 ft, 10 inches) in the United States. The standard deviation is 4 inches. What percentage of men would you expect to be 6 ft. 5 inches or taller?