Prezi

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in the manual

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Statistics for Social Research - 1

Statistics Course
by Brian McCabe on 22 September 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Statistics for Social Research - 1

Statistics for Social Research
Professor McCabe
Measurement and
the Collection of Data

Describing Data and Distributions
Visualizing Quantitative Data
Introduction to Probability
Concept: Observation and the
Unit of Analysis
Concept: Variable
Concept: Levels of Measurement
1. Nominal - Categories, Unranked

2. Ordinal - Categories, Ranked

3. Interval - Continuous, no true zero

4. Ratio - Continuous, true zero
A. Population of Towns in Pennsylvania

B. Religious Denominations
(Christian, Jewish, Muslim, Buddhist)

C. Grade in High School
(Freshman, Sophomore, Junior, Senior)

D. Score on an IQ test
A. Genres of movies

B. Average temperature in DC in August

C. Number of students in each major
at Georgetown

D. Ways to cook your steak
(Well-done, medium-well, medium, etc.)
A. Feeling Thermometer
(between 0 - 100, how do you feel about ... )

B. Positions in the Church Hierarchy
(e.g., Bishop, Archbishop, Pope, etc.)

C. Marital Status
(e.g., Married, Single, Divorced, Widowed)

D. Amount of money in your wallet
Concept: Three Key Measures
of Central Tendency

Mean
Median
Mode
Concept: Mean
Concept: Median
Concept: Mode
Concept: Collecting Data
Concept: Reliability
Concept: Validity
Introducing Statistics
Over 3 percent of people in Washington are HIV+.
More than 75 percent of people with HIV are African-American, even though African-Americans account for less than fifty percent of the population of Washington, DC.
In 2012, Washington, DC supported 110,000 tests for HIV, nearly triple the number of tests supported in 2006.
(Statistics from the DC Department of Health)
The team average batting average for the Washington Nationals last season was 0.258.
For players with at least forty at-bats, the batting average ranges from 0.318 (Jayson Werth) to 0.122 (Gio Gonazles).
Four players have at least 100 hits and a dozen home runs.
(Statistics from the Washington Nationals website)
Statistics refers to more than just the numbers and facts that are collected and reported; instead, statistics refers to the set of tools and techniques social scientists use for
collecting
,

describing
,

analyzing
,
and
interpreting
data about the world around us.
- How do we collect information on the HIV+ population in DC? (Hint: We don't know the HIV status of every person; instead, we make decisions about how to select a sample, make a claim about the population, etc.)
- How do we decide what baseball statistics to collect, record and make decisions based on? For every statistic we track, there are plenty of others that we don't track, or at least don't use in our decision-making process (e.g., when fans decide who should join All-Star teams, reporters decide who is "hot," or management decides who to sign).
Become thoughtful, careful young social science researchers, able to consider ways of collecting and analyzing data, and using quantitative data to make claims about the social world.

Become critical consumers of statistical information, questioning the source, content and claims of quantitative data from the world around you.
Course Goals:

•Identify ways that social scientists collect, describe and analyze data about the social world;

•Create and critically evaluate visual displays of information, including charts, graphs and other visual tools;

•Explain the importance of sampling for making statistical inferences about broader populations;

•Conduct various statistical tests for evaluating the relationship between variables;

•Differentiate between correlation and causation, recognizing the importance of causal inference to social research, as well as the limitations of generating casual estimates;

•Consume statistical information in your everyday lives with a critical eye toward the source of that data and the legitimacy of the research claims.
Why use statistics?

1. ... to describe events (Descriptive Statistics)

2. ... to make inferences about a population (Inferential Statistics)

3. ... to observe patterns or relationships among variables.

4. ... predict future events, or quantify the likelihood of a particular event occurring.
Wait! Can I do statistics if I'm not very good at math?
Yes! For this course, I will assume only a basic set of math skills.
Addition, Multiplication, Division
Squares, Exponents and Square Roots
Basic Linear Algebra
Some of the notation may be unfamiliar.
More important than your math skills, I think, are developing your critical thinking and analytical skills.
"There are three kinds of lies -
lie, damned lies, and statistics."
- Mark Twain, quoting British Prime
Minister Benjamin Disraeli
Descriptive Statistics

1. Winning times for the Olympics marathon.

2. Racial demographics of the Georgetown student body

3. The percentage of Washington, DC public elementary students passing their standardized exams each year.
Inferential Statistics - Using a sample, or a subset of a larger set of data, to draw inferences about the larger population

1. To test whether women approve of President Obama at higher rates than men.

2. To test whether attitudes on social issues (e.g., abortion, gay marriage or racial profiling) reach a certain threshold.
Concept: Independent
vs. Dependent Variable
1
Finding Statistics in Everyday Life
2
3
4
Well, what are statistics?
(And can I do statistics if I hate math?)
Fine, but why would I use statistics?
What should I expect to get out of this course?
Professor Smiley
Professor Sweater Vest
Professor Shabby Tie
Professor
Smiley
5 5 5 5 4
4 4 4 4 4
3 3 3 3 3
3 2 2 2 2

Professor
Sweater Vest
5 5 4 4 4
4 4 4 4 4
4 4 4 4 4
4 4 4 3 3
Professor
Shabby Tie
5 5 5 5 5
5 5 5 5 5
5 3 3 3 1
1 1 1 1 1
Displaying Data Badly
Question: Can you explain the public
option (in the health care debate)?
Blackjack
Roulette
Probability Question #1:
If I randomly pick one card from a deck, what is the probability of picking either the two of hearts or the Ace of Spades?
Probability Question #2:
a) Assuming the chances of having boys and girls is the same, what are the chances my first child will be a boy?
b) Assuming my first child was a boy, what are the chances my second child will be a boy?
c) Assuming my first and second child are both boys, what are the chances that my third child will be a girl?
Probability Question #3:
Assume that I plan to have three children. Before having any children, what are the chances that my first child will be a boy, the second child will be a boy, and the third child will be a girl? Now, regardless of the order, what is the probability that I will end up with two boys and a girl?
Probability Question #4:
The World Series is played until a team wins four games (with the maximum number of possible games being seven). Assuming that each team is equally likely to win each game, and that the games are independent events, what is the probability of having a four-game, five-game, six-game and seven-game World Series?
Length Theoretical Possibility Expected Number (out of 92) Actual Number
4 1/8 11.5 18
5 1/4 23.0 20
6 5/16 28.8 20
7 5/16 28.8 34
Polls for the 2012 presidential election showed that the race was within the margin of error.
At one point, President Obama held an eight-point lead among women, but Governor Romney held an eight-point lead among men.
Polling organizations eventually switched from a sample of registered voters to a sample of likely voters.
(Statistics from Gallup, July 30-August 19)
What is the "mark" of a criminal record? Does evidence of a criminal record change the likelihood that job applicants get interviews?
Are homeowners more likely to vote, volunteer or participate in community organizations than renters?
How have individual donors to political campaigns changed over the last thirty years? Are they more partisan? Do they give more money? Do they give to a larger number of candidates?
Are there differences in outcomes (e.g., educational success, behavioral problems, etc.) between children that grow up in stable, two-parent heterosexual households vs. those that grow up in stable, two-parent same-sex households?
How much does class size matter for kindergarten students? Do students perform better on standardized tests when they learn in small classrooms?
(Available from Hoya Computing)
When we collect information, each of the individuals or subjects in our research represents a unique
observation
.
In research, we need to think about the unit we are analyzing. At what
level
do we collect, analyze and measure data?
e.g., households vs. individual (income)
e.g., colleges vs. college students
Variables
are the characteristics that vary from one observation to another.
Hair color
Eye color
Favorite movie
Worst fear
College GPA
Number of siblings
Annual salary
Top Hat Question:
What are some variables that
we could measure at the unit of the
country?

In other words, what is a characteristic
of countries that varies from one
country to another?
Color of house
Number of occupants
Sales price
Year built
Number of bedrooms
Dependent variables
are the variables whose variation we are trying to explain.

Why do some students score better on standardized tests than other students in DC public schools? (Standardized test scores)
Why do some people make more money than other people? (Income)
What explains why people have different BMIs (or weights, relative to their heights)? (BMI)
Independent variables
are the variables used to predict variation in the dependent variable, or those that are related to the dependent variable.

Does a student's race predict their performance on standardized tests? (Race of student)
Do people with more education tend to make more money? (Level of education)
Does your proximity to a local grocery store predict your BMI? (Distance to grocery store)
Discrete
Continuous
Measurement
Good measurement must be ...
1. Reliable.
2. Valid.
3. Exhaustive (Discrete)
4. Mutually Exclusive (Discrete)
Reliability
refers to the consistency of a measure, whether it produces the same result across time.
Validity
refers to whether the measurement you use actually gets at the concept you're trying to measure.
Concept: Exhaustive
For discrete (ordinal or nominal) variables, we want to make sure that the response options cover all possible outcomes. When all potential responses are included in one of the categories, we can say that the response options are
exhaustive.
For discrete (ordinal or nominal) variables, we want to make sure that all response options fit into one and only one category. In other words, there is no ambiguity about how a response should be coded. When responses fit into one and only one category, we can say that the response options are
mutually exclusive.
Concept: Mutually Exclusive
Concept: Measurement Error
Concept: Organizing Data
Concept: Types of Data
How do social scientists collect data about the social world?
Experimental studies vs. Observational studies
Natural experiments

Controlled experiments
Surveys

Administrative Data

Content analysis
1. Cross-sectional data
the Social Capital Community Survey
a survey conducted by GUSA about campus facilities
A Survey Monkey survey you completed.
2. Repeated cross-sections
the General Social Survey
the American National Election Survey
3. Longitudinal (or Panel) data
the Panel Study of Income Dynamics (PSID)
annual World Bank country indicators
When we organize data for statistical analysis,
we typically organize the
observations in rows
and the
variables in columns
.

Statistical programs, including Minitab, Excel, SPSS and Stata, should make this organization intuitive.
Data (and particularly data collected in a survey) typically comes with a
codebook
that describes the content of the dataset in more detail. The codebook includes information on how the data were collected, the response options for discrete categories, ways the data are coded, information on missing values, etc.
Concept: Frequency Distribution
The
frequency distribution
is a way of understanding all of the observations that share a common property. It displays the
frequency
- or the number of times - that a particular property occurs among observations.
Imagine that in a class, there are eleven seniors, fourteen juniors and one sophomore.
(cc) image by anemoneprojectors on Flickr
Concept: Percents, Proportions,
Rates and Ratios
The
proportion
is the number of items in a group relative to the number of items in total. It is expressed in decimal form.
The
percent
is simply the proportion multiplied by 100.
The
ratio
expresses the comparison of one subgroup to another subgroup (rather than one subgroup to the whole).
Construct a Frequency Table
(Because the variable is nominal, rather than ordinal, you don't need to include the cumulative frequency or the cumulative percentage.)
Number of Medals Won in the 2012 Olympics by Continent
446 - Europe
238 - Asia
166 - North America
34 - Africa
29 - South America
48 - Oceania
Concept: Percentiles
The
percentile
is the value of a variable below which a certain percentage of observations fall. For example, the 25th percentile would be the score or value below which one-quarter of scores (on an exam, for example) fall.
Common percentiles include quartiles (25/50/75), quintiles (20/40/60/80) and deciles (10/20/etc.)
Review: In a class of twenty students, exams on the midterm were as follows:

76 78 78 80 82 82 84 86 89 89
90 92 92 92 93 93 94 96 96 98

Using five-point intervals (e.g., 70-75, 76-80, 81-85, etc.), create a frequency table to describe scores on the midterm. Include both the cumulative frequency and the cumulative percentage.
The measure of
central tendency
are used to tell us something about the normal, typical or average score in a distribution of scores.
Concept: Distribution
The distribution tells us about the frequency that scores occur within any dataset. It lays out and clarifies the set of scores from the data. In some cases, like an exam of twenty students, it is easy to see all the numbers in the distribution. In other cases, there may be too many observations to actually see all the values in the distribution.
But what if we looked at the amount of money every individual donated to political candidates in the 2010 election? There would be millions of observations, and writing them all out would be tedious, boring and unnecessary ...
$25
$200
$35
$100
$60
$60
$55
$320
$100
$105
$140
$20
$10
$100
$200
$330
(Yawn, this is boring)
The
mean
is equal to what you typically think of as the
average.
It is equal to the sum of the scores divided by the total number of scores.
76 78 78 80 82 82 84 86 89 89
90 92 92 92 93 93 94 96 96 98
Sum of scores = 1760
Number of observations = 20
Mean = 1760/20 = 88
Notes: The mean only be used on continuous variables; it can't be used to understand nominal variables. The mean is skewed by outliers. Outliers are scores that are extremely large or extremely small, relative to the rest of the distribution.
Imagine that the lowest two scores were 10 and 15, rather than 76 and 78.
Sum of scores = 1631
Number of observations = 20
Mean = 1631/20 = 81.55
Concept: Weighted Mean
When you're combining group means from groups of different sizes, you can't simply average the means! Instead, you need to take a
weighted mean,
weighted by the size of each group.
Group 1: Women
N = 12
Mean Exam Score = 84.5

Group 2: Men
N = 8
Mean Exam Score = 93.25
If I asked you the mean exam score for the class, you can't simply average the two scores.
It is not just (84.5 + 93.25)/2=88.75.


Instead, you must weight each mean by the number of observations and take the weighted mean.
The proper calculation is ((84.5*12)+(93.25*8))/20=88
The
median
is the
middle score
in an ordered distribution. It is the score that divides the distribution equally in half.
76 78 78 80 82 82 84 86 89 89 92 92 92 93 93 94 96 96 98
The
mode
is the score that occurs most frequently in the distribution.
76 78 78 80 82 82 84 86 89 89 90 92 92 92 93 93 94 96 96 98
Notes: The median is insensitive to other scores in the distribution. Again, you can't use the median with nominal variables.
76 78 78 80 82 82 84 86 89 89 92 92 92 93 93 94 96 96 98
26 28 28 30 32 32 34 36 39 89 92 92 92 93 93 94 96 96 98
The median is the same, even though the bottom nine scores in the distribution have changed substantially. (What would happen to the mean in this example?)
Notes: The mode can be used to talk about the most frequent score, but it doesn't tell us anything about the scores that occur around those scores!
Question: How do we pick between the measures of central tendency? When is the mean the best measure, or when is the mode or median the best?
We often look at course evaluations to determine the "best" professor to take. For each of the following three professors, calculate the mean, median and mode of their course evaluations. Looking at the data, interpret the meaning of these measures of central tendency.
Statistics Lectures
Introducing Statistics
Measurement and the Collection of Data
Describing Data and Distributions
Visualizing Quantitative Data
Introduction to Probability Theory
End!
Concept: Measures of Dispersion
(or Measures of Variability)

When we have a distribution of scores, the
measures of dispersion
(or variability) tell us how the scores are spread around the mean (or another measure of central tendency). In doing so, these measures tell us about the shape of the distribution. Instead of describing the scores, as we do with a measure of central tendency, we are now concerned to describe the way the scores are spread relative to each other.
Concept: Range
The
range
is simply the distance between the minimum and the maximum score in a distribution.
Exam 1: 89-80 = 9
Exam 2: 99-72 = 27
Concept: Deviation
The
deviation
represents the difference between any particular observation (xi) and the mean. For each observation, the deviation tells us the distance from that observation to the mean. It can be positive (if the score is greater than the mean) or negative (if the score is less than the mean).
Concept: Variance
To
variance
is the average sum of the squared distance from each score to the mean. On its own, the variance is not a particularly useful statistic, but it is an important step along the way.
To calculate the variance ...
1. Calculate the difference between each score and the mean.
2. Square each difference (or deviation). (Note: Squaring them makes all values positive!)
3. Add up the squared differences.
4. Divide by the number of observations.
1. Difference between each
score and the mean
2. Square each difference!
(Notice they're all positive!)
3. Add up the squared differences
4. Divide by the number of observations
Concept: Standard
Deviation

To calculate the
standard deviation
, simply take the square root of the variance. The standard deviation is the central statistic that tells us how the scores are spread around the mean in a distribution.
Exam 1 - Standard Deviation
Exam 2 - Standard Deviation
The standard deviation is lower for Exam 1 - where the scores were all bunched closer to the mean - than for Exam 2 - where the scores were spread farther away from the mean.
Example: There are two judges, both of whom sentence criminals charged with misdemeanors. While the mean sentence both judges give the same - 18 months in jail - the standard deviation is very different. One has a very small standard deviation, while the other has a very large standard deviation. What does this mean?
Rules about the standard deviation:

1. Standard deviation is always greater than or equal to zero!

2. Standard deviation is only equal to zero when the all the values in a distribution are the same; in other words, when there is no variation is scores!

3. The greater the variability is scores around the mean, the greater the standard deviation.
Concept: Normal Distribution
Concept: Z-Scores
(or standardized scores)

Concept: Skewed
Distributions

Why do we call it standardized? Well, you can't compare apples and oranges ...
ACT vs. SAT
Both college entrance exams.
However, one point on the ACT (ACT point) does not equal one point on the SAT (SAT point).
Different units of measurement.
In order to compare them, you need to standardize the measurements.
A student who scored 2 standard deviations above the mean SAT score got a 1,100 on the exam.

A student who scored 2 standard deviations above the mean ACT score got a 24 on the exam.

The student who scored a 1,100 on the SAT did as well, relative to the average grade, as a student who scored a 24 on the ACT. They are both 2 standard deviations above the mean.
Now that we know about the standard deviation, we can begin to think about standardized scores, or
Z-scores
.

To simplify the concept, you can consider that every score can be represented in two ways - as a
raw score
or as a
standardized score
(Z-score).
The raw score is the score in the original units of measurement.
You weight 140 lbs.
You scored 85 points on the exam.
You have an IQ of 105.
You have $1,250 in your bank account.
You are 72 inches tall.
These scores are all presented in their original units of measurement - pounds, exam points, IQ points, dollars or inches. Note that these scores tell us nothing about how someone scored
relative to everyone else in the distribution
.
Each of these scores also has a corresponding standardized score, expressed as the number of standard deviations the score falls from the mean score in the distribution.

Your weight is 1.2 sd above the mean.
Your exam score is 1 sd below the mean.
Your IQ is 0 sd from the mean (meaning you have the average IQ score).
Your bank balance is 2 sd below the mean.
Your height is 0.2 sd above the mean.
Note that these scores are all expressed in standard deviation units. We can claim that your height falls much closer to the average height than your bank balance, which is actually pretty far from the average bank balance (even though height is measured in inches and bank balances are measured in dollars)!
Calculating the Z-Score
Example: The average score on the Physics midterm was an 82. Smart kid that you are, you scored a 92. The professor calculated the standard deviation and told you that it was 6. What is your Z-score?
Wait! I've calculated a Z-Score, showing that I scored 1.67 standard deviations above the mean on the Physics exam, but what does that actually mean?
Question: On the next exam, you again score a 92 and the class average is again an 82. However, the standard deviation has changed to 10. What does this mean for the spread of scores on the exam? What does this mean for your score, relative to the other scores?
This is the most common curve you will see in statistics. It is called the normal distribution.
68%
95%
68 percent of scores fall
within 1 sd of the mean!
95 percent of scores fall
within 2 sd of the mean!
99 percent of scores fall
within 3 sd of the mean!
99%
The
normal distribution
- also known as the bell curve - is a symmetrical curve defined by two statistics - the mean (mean = mode = median) and the standard deviation. In the curve, half of all observations fall above the mean and half fall below the mean. Many social phenomenon (e.g., intelligence, height, etc.) approximately follow the normal distribution. As you get farther from the mean score, you will find fewer and fewer observations.
The three rules of thumb!
2.10 2.40 2.70 3.00 3.30 3.60 3.90 Raw Score
The GPA Distribution of Students at College X
What is the average GPA?
What is the standard deviation?
For a student with a GPA of 3.45 (raw score), what is her standardized score?
If a student has a GPA 1 standard deviation below the mean, what is his GPA?
What percentage of GPAs fall between 2.70 and 3.30?
95 percent of GPAs fall between what raw scores?
Would it be common to find a student with a GPA of 2.00 or below?
When scores are normally distributed, the right tail of the distribution is the same length as the left tail of the distribution, and the mean=median=mode.

However, we will sometimes find social phenomenon that are not normally distributed. This is because some observations have extremely high or extremely low scores, thereby making it so that the mean, median and the mode are not equal to one another.
Example of a Right Skew (or Positive Skew): In the United States, income is not normally distributed because some people make millions of dollars. What happens to the mean, median and the mode when you have some outliers at the top of the distribution?
Example of a Left Skew (or Negative Skew): On a final exam, a handful of students do really poorly, getting extremely low grades relative to everyone else. What happens to the mean, median and the mode when you have some outliers at the bottom of the distribution?
16 Books
37 Books
Between 2008 and 2013, the graphical representation of books increased by 131%. It more than doubled, from 16 to 37.
However, tuition increased by only 16 percent - from $47,908 to $55,640.
Thus, the visual representation is extremely misleading.
30 stick figures
13 stick figures.
11.5 stick figures.
18 stick figures.
The ratio of stick figures from 2013 to 2015 (11.5: 30) makes it look like the yield nearly tripled! In fact, the yield went up by only 15%.
Rules for Displaying
Data Well
Concept: Bar Graphs
Concept: Line Graphs
A
bar graph
is a visual display of discrete categories (either nominal or ordinal) where the
length of each bar
represents the
percentage of frequency
of a category.
Title: The percent of people (age 12 or older) who report using illicit drugs last month, by type of county.
Source: 2010 National Survey of Drug Use and Health
Source: 2010 National Survey of Drug Use and Health
Number of users (age 12 and older) with dependence of abuse, by drug type.
Concept: Histograms
A
histogram
is a visual display for
continuous data (interval/ratio)
where the scores are presented along one axis and the frequency (or percentage) of that score is presented along the other axis. Often, continuous data are recoded into categories before the construction of a histogram (e.g., a continuous GPA may be recoded into intervals of 0.10). Histograms are often used to show the distribution of continuous data.
Average annual count of evicted tenants, by gender and neighborhood racial composition
Source: Desmond 2012
How could you improve
the quality of this graph?
Predicted Probability of Trusting Various Social Groups, by Homeownership Status
Source: McCabe 2012
A
line graph
is a visual display of data typically used to track a social phenomenon across time, or some other continuous measure.
Concept: Pie Graphs
Pie Charts are good for making representations of Pac Man, but aren't particularly good for displaying statistical information. The reason is two-fold. First, and most importantly, pie charts (like bar charts or histograms) can tell us about the relative relationship between two variables, but tell us nothing about their frequency. Second, it is often difficult to correctly visualize the relative size of a piece of the pie.
Which Type of Visual
Tool Should I Use?
6. I want to show the distribution of GPAs for the students at Georgetown.
Using Statistical Tools to Describe and Visualize Quantitative Data
Open and describe the data (e.g., the number of variables, observations, missing values, etc.).
Sort the data according to particular variables in the data.
Recode continuous measures into discrete measures (e.g., continuous age measure into categorical age measure).
Get basic descriptive statistics (e.g., measures of central tendency, measures of dispersion).
Create data visualizations (bar charts, histograms, line graphs, etc.)
How do you play roulette (and why is does that woman look like she's having so much fun)?
Concept: Probability
Probability
refers to the likelihood that a particular outcome will occur over a long sequence of observations. It is equal to the proportion of times we expect a particular event over a large number of trials.
Notation:
P(A)
refers to the probability of event "A" occurring. For example, in the flip of a fair coin,
P(Head) = 0.50
. In a class of twenty-five students where ten of the students are sophomores, the probability of picking a sophomore when randomly selecting a student =
P(sophomore) = 0.40
.
Concept: Probability
Rules (or Probability
Rules!)

1. The probability of an event occurring equals the number of successful outcomes divided by the total number of possible outcomes.

P(A) = Number of Successful Outcomes/Number of Total Outcomes
2. The probability of an event occurring always ranges between 0 and 1.
3. Converse Rule: The probability of an event not occurring is equal to 1 minus the probability of that event occurring.

P(not A) = 1-P(A)
4. Addition Rule: If A and B are distinct outcomes with no overlap, then the probability of either getting A or B is equal to just adding up the probability of both outcomes.

P(A or B) = P(A) + P (B)
5. Multiplication Rule: The probability of getting a combination of events is equal to the probability of their separate occurrences.

If A and B are
independent
events,
then P(A and B) = P(A) * P(B).
4a. Adjusting for Joint Occurrence: If an event double-counts, we have to make a correction to eliminate double-counting events. In this case, we simply subtract out the joint occurrences.
5a. Conditional Probability: If A and B are both possible outcomes, then P(A and B) = P(A) * P(B given A)
If I randomly select one student in this class, what is the probability he or she will have a Georgetown ID?
If I randomly select one student in this class, what is the probability he or she will already have a bachelors degree?
P(King): The probability of selecting a King from a deck of cards is 0.0769.

P(Not King): The probability of not selecting a King from a deck of cards is 0.9231.
Concept: Probability Distribution
Concept: Random Variable
Concept: Probability
and the Normal Curve

The
rate
is the frequency of an occurrence, relative to a base number (measured in the 10's, 100's or 1000's, etc.)
Seven cities with the highest murder rate (2010)
(Note: the murder rate is the number of murders
per 100,000 people
)

1. New Orleans - 49.1
2. St. Louis - 40.5
3. Baltimore - 34.8
4. Detroit - 34.5
5. Newark - 32.1
6. Oakland - 22.0
7. Washington, DC - 21.9
Can we measure
grit
?
Concept: Unit of Measurement
What is the unit you're measuring?

For income, you're measuring in dollars.

For standardized test scores, you're measuring in points.

For education, you're measuring in years of schools.
Nominal
Ordinal
Interval
Ratio
(Special case: Dichotomous/Dummy [yes/no])
Note: It is possible to make continuous variables into discrete categories. For example, you could have a continuous age variable (e.g., 18, 19, 20, 21, 22, 23, 24 etc.) and recode it into a categorical variable (e.g., 18-21, 22-30, 31-40, etc.)
Person A: Weighs 150 lbs!
Weighs himself five times and gets the following scores ...
130
145
150
120
160
Weighs himself five times and gets the following scores ...
125
126
125
126
124
1
2
What makes the Census question about race exhaustive?
Do you think the Census question about race is mutually exclusive?
Measurement is imperfect.

Sometimes the instruments are imperfect (e.g., the scale is slightly off). Sometimes people are misleading about their responses (e.g., they might under-report their weight, or over-report their voting behavior).

While we work to minimize measurement error, there is the possibility for error in all measurement.
1. Imprecise tools.

2. Poorly worded questions, surveys.

3. Interviewer biases.

4. Respondent biases (e.g., social desirability).

5. Coding/processing errors.
Using the last names of students in this class, construct a frequency table.
Final Activity:
How do we measure poverty
in the United States?
We often hear accounts of poverty, or the percentage of Americans who live in poverty, but what are the actual indicators used to measure poverty?
"X bar" is the mean!
"Sigma" is to sum everything
or to add them all up!
"X i" (or X subscript i) is the ith
observation in a dataset
"N" is the total number of
observations in that dataset
10 15
76 78
78 80 82 82 84 86 89
89 90 92 92 92 93 93 94 96 96 98
10 15
76 78
78 80 82 82 84 86 89
89 90 92 92 92 93 93 94 96 96 98
Outliers!
Outliers are scores the are markedly different from the rest of the scores in the distribution. They distort calculations of central tendency, like the mean.
Looking back at these scores ... In which distribution do the scores typically fall closest to the mean?
Example: Home Prices in DC
To calculate the variance
1. Subtract each observation
from the mean (to get the deviation)
2. Square the deviation
3. Add up all of the squared
deviations ("Sum of Squares")
4. Divide the Sum of Squares
by the number of observations
to get the variance.
(Note: These are two normal curves.
The peak of each curve is the mean.
When the scores are bunched closer
to the mean, the standard deviation
is small; when the scores are spread
wider from the mean, the standard
deviation is larger. We will learn
about the normal curve soon.)
Two student in two separate Introduction to Sociology course both score a 90 on their exam. Are all 90s created equally?

If a person in class A got a 90, but the mean was 95, he did below average; if a person in class B got a 90, but the mean was 75, she did well above the average.

Even though the raw scores are the same (because 90=90), the students scored very differently
relative to the rest of the students in their class!
Example: Calculate
the measures of dispersion.

There are six jobless households. They have
received unemployment benefits for the following number of weeks:
9 8 6 4 2 1
On
Top Hat
, calculate the mean, the range, and the standard deviation for the distribution.
Mean = 5
Range = 8
Standard Deviation = 2.94
The standardized score
for a value of X
The difference between
X and the mean of X (X-bar)
The standard deviation
Things to remember about
the normal curve ...
1. The entire area under the curve
is always equal to 100 percent!
2. The peak of the normal curve is
the mean of the distribution. In a
normal distribution, the mean = mode =
median.
3. Half of all observations fall above the mean.
Half of all observations fall below the mean.
4. Nearly all scores - more than 99 percent of them! - fall within three standard deviations of the mean
(+/- 3 sd). It is very unlikely to find an observation with a score more than 3 sd from the mean.
5. About 95 percent of scores fall within 2 sd of the mean. About 68 percent of scores fall within 1 sd of the mean.
What percentage of scores fall
between 0 and 1 standard
deviation
above
the mean?
What percentage of scores fall
between 0 and 2 standard
deviations
below
the mean?
What percentage of scores fall
between 0 and 3 standard
deviation
above
the mean?
Normal Distribution Table

When you calculate a z-score, it often won't be a nice, even number (e.g., 1, 2 or 3). You may get a z-score of 1.45, or a z-score of -2.05.
In the Normal Distribution Table, you will find z-scores and the corresponding area under the normal curve for each z-score.
Top Hat:
On the SATs, Person A scores 1.5 standard deviations above the mean. Assuming SAT scores are normally distributed, what percentage of observations fall at or below her score?
Top Hat:
A parent brings her child to get measured (height) and weighed. The child's height is 0.20 standard deviations below the mean. What percentage of children fall within 0.20 standard deviations on either side of the mean?
Top Hat:
The average strawberry weighs 8.5 ounces, with a standard deviation of 1 oz. A worker finds a strawberry that weighs 11 ounces! After calculating the z-score, what percentage of strawberries would weigh more than this one?
Scores
Frequency (# of times it occurs)
Here's what you should be able to do by now ...
- Distinguish between levels of measurement (e.g., nominal, ordinal, interval, etc.).
- Talk about the challenges of measurement (e.g., reliability, validity, mutual exclusivity, etc.) in the social sciences.
- Calculate rates, ratios, percentages and proportions.
- Make a frequency table, including recoding continuous variables into discrete measures.
- Calculate the measures of central tendency (mean, median and mode) and talk about their limitations.
- Calculate the measures of dispersion (range, deviation, standard deviation).
- Explain the standard deviation.
- Calculate a z-score and explain the difference between a z-score and a raw score.
- Use the normal distribution table in the back of your book when calculating z-scores.
Top Hat:
We are studying the distribution of annual income for cashiers at fast food chains. The mean income is $14,000 and the standard deviation is $1,500. Assuming a normal distribution, what percentage of cashiers earn between $14,000 and $16,000?
1. Visualizing Data Badly

2. Creating Good Data Visualizations
(Bar Charts, Line Graphs, Histograms)

3. Crash course on statistical programs
(e.g., Excel, MInitab)
1. Open & Describe the Data
2. Sort the Data
3. Recode Variables
4. Descriptive Statistics
5. Data Visualizations
Opening Data:
FILE --> Open Project

Basic Descriptives:
STAT --> Basic Statistics -->
Display Descriptive Statistics
Sorting Data:
DATA --> Sort
Recoding Data:
DATA --> Code
Descriptive Statistics:
CALC --> Column Statistics

Z-Scores:
CALC --> Standardize
Data Visualizations:
GRAPH --> Bar Chart
GRAPH --> Line Chart (or Time Series Plot)
GRAPH --> Histogram
1. Axis scales can be misleading!
Don't manipulate the scale in a way
that makes people read the figure differently.
2. The visual images should be
correctly sized to the data. Otherwise
we end up with perceptual distortion
(where the image tells a story different from the actual data).
3. Consistent scales & consistent axes!
(Units across the axes must be the same.)
4. Enough information, but not too much information. (Avoid data junk.)
1. Select the appropriate type of graph.

2. Clearly label your axes.

3. Ensure consistent scales on the axes. Include a legend (where appropriate).

4. Write titles that identify the information in the chart. (We should be able to "read" the chart without any accompanying text).

5. Avoid perceptual distortions.

6. Minimize data "junk". (This includes excess colors, symbols, and information not directly related to the data story itself.)

7. Remember: Data visualizations are being used to tell a story!
Explain Each Visual Display of Data ...
1. For every year since 1980, I want to graph the acceptance rate at Georgetown.

2. I want to plot the number of murders committed in each of the four quadrants (NW, NE, SW, SE) in DC last year.
3. I would like to show the distribution of SAT scores for students from public and private schools.
4. I want to show voter participation rates for different age groups.
5. I would like to compare the incarceration rates in each state in the United States.
Reading Social Research
Understanding Statistical Tools in Contemporary Research
Murray et al. 1990. Teacher Personality Traits and Student Instructional Ratings in Six Types of University Courses. Journal of Educational Psychology.
McAdam, Doug and Cynthia Brandt. 2009. Assessing the Effectiveness of Voluntary Youth Service: The Case of Teach for America. Social Forces
From a deck of cards (52), the probability that
I will randomly select a King = P(King) = 4/52 = 0.0769
P(King or Queen) = P(King) + P(Queen) = 4/52 + 4/52 = 0.0769 + 0.0769 = 0.1538.
P(King or Heart ) = P(King) + P(Heart) - P(Joint Occurrence) = 4/52 + 13/52 - 1/52 = 16/52 = 0.3077.
P(King and Tails) in two separate draws: P(King) * P(Tails) = 4/52 * 1/2 = 4/104 = 0.385
with Replacement
vs.
without Replacement
Question: What is the probability of selecting two Kings in a row? Here, replacement matters!
Do you put the card back in the deck before selecting the second card, or do you select from the remaining 51 cards?

P(Ace and Ace) with replacement?
P(Ace and Ace) without replacement?
In a randomized events (e.g., child birth, flipping coins, etc.), the probability distribution describes the likelihood of each possible outcome of the event. The probability distribution is analogous to the frequency distribution,
except that it is based on the number of expected occurrences in the long-term (based on probability theory) rather than the actual number of occurrences (as described by empirical evidence).
The normal curve - which you have already seen - is basically an ideal or theoretical model showing the probability that particular events will occur. When we calculated z-scores and looked at the area under the curve, we were figuring out how likely it was that particular events would occur.
This is an application of the rules of probability.
Percentile Rankings: We can also use the area under the normal curve to think about percentiles, knowing the
percentage of the population that falls at or below
a certain score.
Question: How likely is a score to fall +/- 1 standard deviation from the mean?

P (+/-1sd) = P(-1<z<0) + P(0<z<1) = .3413 + .3413 = .6826
Question: What is the percentile ranking for someone with a score 1 standard deviation above the mean?

P(<1 sd) = P(z<0) + P(0<z<1) = .5000 + .3413 = .8413 = 84th percentile = 84 percent of scores fall at or below this score.
Question: How likely is a score to fall between 2 and 2.5 standard deviations above the mean?

P(2<z<2.5) = P(0<z<2.5) - P(0<z<2) = 0.4938-0.4772 = .0166
Question: If you scored 1.4 standard deviations above the mean, what is your percentile ranking?

P(z<1.4) = 0.5000 + 0.4192 = .9192

Percentile ranking is the 92nd percentile. You scored at or above 92 percent of people.
If I flip a coin twenty times and it lands on heads 12 times (and tails 8 times) ....

Probability Frequency

Heads 0.50 0.55

Tails 0.50 0.45
Your turn!

Questions?
Comments?
Confusions?
Thoughts?

To observe patterns or relationships among variables

1. What is the relationship between how high students rate a course and how easy they think it is?

2. What is the relationship between the number of crimes committed in a neighborhood and the number of bars and restaurants?

3. What is the relationship between whether or not a basketball player made his last free-throw and whether or not he makes his next one?

4. What is the relationship between your parents' income and your future income? (Do you think this varies depending on your race, where you grew up, or the year you were born?)
To predict future events ...

1. Can we use statistics to quantify the likelihood of a political candidate winning, or a basketball team winning?

2. Can we use statistics to quantify the likelihood that it will be sunny, or that it will snow?
Comparison: Bar Chart vs. Pie Chart

Concept: Odds
The
odds
refer to the probability that an event will occur relative to the probability that an event will not occur.
If there are twenty women in the class and ten men,
the odds of randomly selecting a woman is 2:1.

If I randomly pick a day of the week, the odds that I do not pick Tuesday are 6:1.
Question 1a: What is the probability that the ball will land on black?
Question 1b: What are the odds that the ball will land on black?

Question 2a: What is the probability that the ball will land on 21?
Question 2b: What are the odds that the ball will land on 21?

Question 3a: What is the probability that the ball will land on 28 or 31?
Question 3b: What are the odds that the ball will land on 28 or 31?
Question: How likely is a score to fall +/- 1 standard deviation from the mean?
Question: What is the percentile ranking for
someone with a score 1 standard deviation above
the mean?
Question: How likely is a score to fall between 2 and 2.5 standard deviations above the mean?
Question: If you scored 1.4 standard deviations above the mean, what is your percentile ranking?
Question: Sylvia has an IQ of 95. Herman has an IQ of 101. Assuming that IQ scores are normally distributed, with a mean of 100 and a standard deviation of 15, what percentage of the population has an IQ score between Sylvia and Herman?
Question: Sylvia has an IQ of 95. Herman has an IQ of 101. Assuming that IQ scores are normally distributed, with a mean of 100 and a standard deviation of 15, what percentage of the population has an IQ score between Sylvia and Herman?

P(-0.333<z<0.067) = P(-0.333<z<0) + P(0<z<0.0667) = .1293 + .0279 = .1572
Top Hat:
How do we measure race?
What about the idea of the "middle class"?

What types of characteristics would I look
for to determine whether someone was
in the middle class?
Back to the middle class example ...

What is a measure of middle class status
that accurately gets at the concept we're
trying to measure?
Top Hat: Class Survey

Top Hat: Group Project Assignments

You will find a list of fifteen potential
research topics and an ordered ranking
(1-15) to match. Read through the list, and
rank the topics that interest you (with
1 being the most interesting and 15 being
the least interesting). These rankings will
be the basis of your group project assignments.
Top Hat Review: Percentiles

Top Hat Review: Outliers
Top Hat Question: Describe the
distribution of wealth in the
United States
Top Hat:
Square Root
Group Projects:
Discussion
Comments
Thoughts
What are the some of the pitfalls that Wainer
(in Displaying Data Badly) gives in his exposition
of the rules for displaying quantitative data?
Note: Sociological research on the
likelihood of having a third child
if your first two children are of the
same sex.

Do you think parents are more likely or
less likely (or that the likelihood is
unchanged) to have a third child when
their first two children are of the
same sex?
See the full transcript