Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Transcript of Statistics
Representatives who voted for the Civil Rights Act of 1964
House Democrat Republican
Northern 94% (145/154) 85% (138/162)
Southern 7% (7/94) 0% (0/10)
Both 61% (152/248) 80% (138/172)
The Republicans were more concentrated in the north and northern representatives were more supportive of the bill.
Why are Republicans overall more supportive when the Democrats in each section were more supportive?
Fences of Boxplots
Upper fence = Q3 + 1.5(IQR)
Lower fence = Q1 - 1.5(IQR)
For 5 Number Summary:
STAT CALC --> 1-VAR STATS --> ENTER
1 VAR STATS L1 --> ENTER
Where "mu" stands for the mean and "sigma" stands for the standard deviation
Shifting and Rescaling Data
2nd DISTR--> normalcdf--> ENTER
normalcdf (low z score, high z score)--> ENTER = Area of normal model
2nd DISTR--> invNorm( --> ENTER
invNorm(Percentile) --> ENTER = Cut point for percentile
A score at or above the 90th percentile is how many standard deviations from the mean?
Answer: 1.28 SD's or more above the mean
Test scores are normally distributed with a mean of 72 and a standard deviation of 4.2 . What is the percentile of a score of 65 (round to the nearest whole percentile)?
Answer: 5th percentile
Given that 10% of the nails made by a manufacturer have a length less than 2.48 in., while 5% have a length greater than 2.54 in., what are the mean and standard deviation of the lengths of the nails? Assume that the lengths have a normal distribution.
Quantitative variables condition
Straight enough condition
Sign of correlation coefficient gives direction of association
Correlation of x with y is same as correlation of y with x
Correlation has no units
Correlation not affected by changes in center or scale of either variable
Correlation is sensitive to outliers
Correlation does not prove causation!
Time for Mr. Whiskers!
Residual: Difference between observed value and predicted value
Line of best fit: line for which sum of squared residuals is smallest
Regression Equation (y) = a + bx
Slope of regression line:
b = r(sy/sx)
Slope of best fit line for z-scores is the correlation, r
Types of Variables
Mean is skewed towards tail
Center and Percentiles
5 Number Summary
Skewed: Median & IQR
Symmetric: Mean & SD
Normal Model: N(
We know our variables. What's their relationship?
r^2 = the percentage of variability in y that is explained by x
STAT CALC -> LinReg (a+bx) -> ENTER
See equation, r, r^2
For residuals plot:
Go to STATPLOT -> x list: L1, y list: RESID -> Turn plot on, zoom stat
Don't fit a straight line to a non-linear relationship
Woah woah woah, hold up! Before all of this, we need the data!
Where it all begins...
1) Assign digits
2) Define trial
3) Define success
4) Run simulation
5) Analyze results
6) State conclusion
To seed (ex. seed 5): enter 5, then MATH, then PRB, then rand, hit ENTER
MATH PRB, randInt(0,9, 5) etc. , ENTER
Addition Rule: P (A U B) = P(A) + P(B), A and B must be disjoint
P(A B)= P(A) * P(B), A and B must be independent
A certain bowler can bowl a strike 70% of the time. What is the probability that she:
a) goes 3 consecutive frames without a strike
b) makes her first strike in the 3rd frame?
c) has at least one strike in the first 3 frames?
d) bowls a perfect game (12 consecutive strikes)?
Now to Probability Rules
Addition Rule: P(A U B) = P(A) +P(B) - P(A B)
Computer Regression Analysis
Make Distribution more symmetric
Make the spread of several groups more alike
Make the form of a scatterplot more linear
The Power of Re-Expression
Ladder of Powers
"0" or log
Randomize to protect against effects we aren't aware of
Types of Sampling Designs
Simple Random Sample
Calling at 2 PM
P(B|A) = P(A B) / P(A)
If independent, P(B|A) = P(B)
We can use the various sampling methods in an experiment.
Experiments can prove a relationship between two variables.
4 Principles of Experimental Design
4) Block- remove variability due to difference among blocks
Those lurking variables...
Common response-both the explanatory and response variable are influenced by the 3rd variable
Confounding-When 2 variables are entwined in such a way that you can't tell which are having an effect on the response variable
P(A B) = P(A) x P(B|A)
General Multiplication Rule
Answer: E, use P(California | Phone)
Okay, I know how to find my probabilities. What's next?
Building a probability model to calculate expected values and standard deviations!
Expected Value: E(X) = u = x * P(X=x)
Example: E(X) = 10,000(1/1000) +5000(2/1000) + 0(997/1000)
(SD(X))^2 = (x - u)^2 * P(X=x)
SD(X) = sqrt(VAR (X))
Shifts in Data:
E(X plus or minus c) = E(X) plus or minus c
VAR (X plus or minus c) = VAR (X)
E(aX) = aE(X)
VAR(aX) = a^2 Var(X)
E(X plus or minus Y) = E(X) plus or minus E(Y)
VAR(X plus or minus Y) = VAR(X) + VAR(Y)
Is your situation a Bernoulli trial?
Bernoulli trials: 1) have only 2 possible outcomes
2) have a constant probability of success
3) have independent trials
You can use two models!
Geometric- to find how long it will take to achieve a success, order does matter
u = 1/p
SD = sqrt(q/p^2)
Binomial- interested in the number of successes, order doesn't matter
SD = sqrt(npq)
P(X=x)= q^x-1 * p
P(X=x) = (n!/(x!(n-x)!)) *p^x * q^n-x
2nd DISTR geometpdf (p,x) - ex. probability of finding a picture in the 5th box
geometcdf(p, x) - ex. probability of finding first success on or before the xth trial
binompdf(n,p,x) - ex. find probability of finding a card exactly twice in 5 boxes
binomcdf(n,p,x) - total probability of getting x or fewer successes among n trials
A binomial model is approx. Normal if np >= 10 and nq >= 10
From the hypothetical to the real
Sampling Distribution Models
SD (p-hat) = sqrt(pq/n)
center at p
1) 10% condition
2) Success/failure condition
Central Limit Theorem: As sample size n increases, the mean of n independent values has a sampling distribution that tends toward a Normal model with mean u(y) = population mean, u, and SD(y) = SD/sqrt(n)
Note 3 conditions for CLT: 1) Random sampling condition
2) Independence assumption
3) 10% condition
Also note: the Standard deviation of a sampling distribution is called the
How do we determine a range for the real proportion, and how confident are we that the real proportion is within this interval?
Confidence Interval = p-hat +/- ME
ME = z* x SE(p hat)
Must meet conditions:
3) 10% condition
4) Success/failure condition
In the calculator, you can use 1-PropZInt to determine the CI : STAT -> Tests-> A
Using these to support hypotheses
Ho (AKA null hypothesis) - proportion of ___ has not changed
Ha (AKA alternate hypothesis) - proportion of __ =/= or > or < original p
SD( p hat) = sqrt(pq/n)
Use one-proportion z-test
z = (p hat - p0) / SD (p hat)
2) Random sampling
3) 10% condition
4) Success/failure condition
np and nq both greater than 10
significance level generally equals 0.05 or 0.01
Type I Error - Ho is true but you rejected it
Type II Error - Ho is false but you retain it
Power: the probability of correctly rejecting a false hypothesis
Effect size- distance between the null hypothesis value p0 and the true value p
Reduce Type I and Type II errors by increasing sample size
Generally, reducing Type I errors will increase Type II errors