Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Statistics

No description
by

Julia Liu

on 4 April 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Statistics

Statistics
Area Principle
Area Principle
Simpson's Paradox
Representatives who voted for the Civil Rights Act of 1964

House Democrat Republican
Northern 94% (145/154) 85% (138/162)
Southern 7% (7/94) 0% (0/10)
Both 61% (152/248) 80% (138/172)
The Republicans were more concentrated in the north and northern representatives were more supportive of the bill.
Why are Republicans overall more supportive when the Democrats in each section were more supportive?
Fences of Boxplots
Upper fence = Q3 + 1.5(IQR)
Lower fence = Q1 - 1.5(IQR)
TI Tips
For 5 Number Summary:
STAT CALC --> 1-VAR STATS --> ENTER
1 VAR STATS L1 --> ENTER
Standard Deviation
Where "mu" stands for the mean and "sigma" stands for the standard deviation
Shifting and Rescaling Data
68-95-99.7 Rule
TI Tips
2nd DISTR--> normalcdf--> ENTER
normalcdf (low z score, high z score)--> ENTER = Area of normal model
2nd DISTR--> invNorm( --> ENTER
invNorm(Percentile) --> ENTER = Cut point for percentile
Example Problem
A score at or above the 90th percentile is how many standard deviations from the mean?
Answer: 1.28 SD's or more above the mean
Example Problem
Test scores are normally distributed with a mean of 72 and a standard deviation of 4.2 . What is the percentile of a score of 65 (round to the nearest whole percentile)?
Answer: 5th percentile
Example Problem
Given that 10% of the nails made by a manufacturer have a length less than 2.48 in., while 5% have a length greater than 2.54 in., what are the mean and standard deviation of the lengths of the nails? Assume that the lengths have a normal distribution.
Answer:
Mean= 2.506
SD= 0.0205
Scatterplots
Note:
Direction
Form
Scatter/strength
Outliers
Correlation Conditions
Quantitative variables condition
Straight enough condition
Outlier condition
Correlation Properties
Sign of correlation coefficient gives direction of association
Correlation of x with y is same as correlation of y with x
Correlation has no units
Correlation not affected by changes in center or scale of either variable
Correlation is sensitive to outliers
And...
Correlation does not prove causation!
Answer: B
Answer: D
Time for Mr. Whiskers!
Linear Regressions!
Linear Regressions
Residual: Difference between observed value and predicted value
Line of best fit: line for which sum of squared residuals is smallest
Linear Regressions
Regression Equation (y) = a + bx
Slope of regression line:
b = r(sy/sx)
Slope of best fit line for z-scores is the correlation, r
Types of Variables
Catagorical
Quantitative
Mean is skewed towards tail
Remember SOCS!
Adding/Subtracting
Center and Percentiles
Spread
Multiplying/Dividing
Center
Percentiles
Spread
5 Number Summary
Max
Q3
Median
Q1
Min
Skewed: Median & IQR
Symmetric: Mean & SD
Describing Distribution
Describing Distribution
Normal Model: N(
,
)
We know our variables. What's their relationship?
Prediction
Categorical Data
Beware the...
Quantitative Data
r^2 = the percentage of variability in y that is explained by x
TI Tips
STAT CALC -> LinReg (a+bx) -> ENTER
See equation, r, r^2
For residuals plot:
Go to STATPLOT -> x list: L1, y list: RESID -> Turn plot on, zoom stat
Beware...
Don't fit a straight line to a non-linear relationship
Extraordinary/influential points
Extrapolation
Lurking variables
Randomness
Woah woah woah, hold up! Before all of this, we need the data!
Where it all begins...
Simulation
1) Assign digits
2) Define trial
3) Define success
4) Run simulation
5) Analyze results
6) State conclusion
TI Tips
To seed (ex. seed 5): enter 5, then MATH, then PRB, then rand, hit ENTER
MATH PRB, randInt(0,9, 5) etc. , ENTER
Probability
Addition Rule: P (A U B) = P(A) + P(B), A and B must be disjoint
Multiplication Rule:
P(A B)= P(A) * P(B), A and B must be independent
U
Practice Problem:
A certain bowler can bowl a strike 70% of the time. What is the probability that she:
a) goes 3 consecutive frames without a strike
b) makes her first strike in the 3rd frame?
c) has at least one strike in the first 3 frames?
d) bowls a perfect game (12 consecutive strikes)?
Answers:
a) 0.027
b) 0.063
c) 0.973
d) 0.0138
Now to Probability Rules
Addition Rule: P(A U B) = P(A) +P(B) - P(A B)
U
Computer Regression Analysis
Make Distribution more symmetric
Make the spread of several groups more alike
Make the form of a scatterplot more linear
The Power of Re-Expression
Ladder of Powers
Power
Use For
2
1
1/2
"0" or log
-1/2
-1
Unimodal distributions
No change
Counts
Non-zero measurements
-----
Ratios
Sample Surveys
Randomize to protect against effects we aren't aware of
Types of Sampling Designs
Simple Random Sample
Stratified
Cluster
Multistage
Systematic
Beware of:
Non-response bias
Response bias
Undercoverage Bias
Calling at 2 PM
P(B|A) = P(A B) / P(A)
If independent, P(B|A) = P(B)
We can use the various sampling methods in an experiment.
Experiments can prove a relationship between two variables.
4 Principles of Experimental Design
1) Control
2) Randomize
3) Replicate
4) Block- remove variability due to difference among blocks
Those lurking variables...
Common response-both the explanatory and response variable are influenced by the 3rd variable
Confounding-When 2 variables are entwined in such a way that you can't tell which are having an effect on the response variable
Part 2
U
P(A B) = P(A) x P(B|A)
U
General Multiplication Rule
Example Problem:
Answer: E, use P(California | Phone)
Okay, I know how to find my probabilities. What's next?
Building a probability model to calculate expected values and standard deviations!
Expected Value: E(X) = u = x * P(X=x)
Example: E(X) = 10,000(1/1000) +5000(2/1000) + 0(997/1000)
Standard deviation:
(SD(X))^2 = (x - u)^2 * P(X=x)
SD(X) = sqrt(VAR (X))
Shifts in Data:
E(X plus or minus c) = E(X) plus or minus c
VAR (X plus or minus c) = VAR (X)
E(aX) = aE(X)
VAR(aX) = a^2 Var(X)
E(X plus or minus Y) = E(X) plus or minus E(Y)
VAR(X plus or minus Y) = VAR(X) + VAR(Y)
Is your situation a Bernoulli trial?
Bernoulli trials: 1) have only 2 possible outcomes
2) have a constant probability of success
3) have independent trials
Yes?
You can use two models!
Geometric- to find how long it will take to achieve a success, order does matter

u = 1/p
SD = sqrt(q/p^2)
Binomial- interested in the number of successes, order doesn't matter

u= np
SD = sqrt(npq)
P(X=x)= q^x-1 * p
P(X=x) = (n!/(x!(n-x)!)) *p^x * q^n-x
TI Tips

2nd DISTR geometpdf (p,x) - ex. probability of finding a picture in the 5th box
geometcdf(p, x) - ex. probability of finding first success on or before the xth trial
binompdf(n,p,x) - ex. find probability of finding a card exactly twice in 5 boxes
binomcdf(n,p,x) - total probability of getting x or fewer successes among n trials
A binomial model is approx. Normal if np >= 10 and nq >= 10
From the hypothetical to the real
Sampling Distribution Models
SD (p-hat) = sqrt(pq/n)
Proportions:
center at p
Conditions:
1) 10% condition
2) Success/failure condition
Central Limit Theorem: As sample size n increases, the mean of n independent values has a sampling distribution that tends toward a Normal model with mean u(y) = population mean, u, and SD(y) = SD/sqrt(n)
Note 3 conditions for CLT: 1) Random sampling condition
2) Independence assumption
3) 10% condition
Also note: the Standard deviation of a sampling distribution is called the
standard error
How do we determine a range for the real proportion, and how confident are we that the real proportion is within this interval?
Confidence Interval = p-hat +/- ME
ME = z* x SE(p hat)

Must meet conditions:
1) Independence
2) Random
3) 10% condition
4) Success/failure condition

In the calculator, you can use 1-PropZInt to determine the CI : STAT -> Tests-> A
Using these to support hypotheses
Ho (AKA null hypothesis) - proportion of ___ has not changed

Ha (AKA alternate hypothesis) - proportion of __ =/= or > or < original p

SD( p hat) = sqrt(pq/n)
Use one-proportion z-test
z = (p hat - p0) / SD (p hat)

3 conditions

1) Independence
2) Random sampling
3) 10% condition
4) Success/failure condition
np and nq both greater than 10

significance level generally equals 0.05 or 0.01
Errors:
Type I Error - Ho is true but you rejected it
Type II Error - Ho is false but you retain it

Power: the probability of correctly rejecting a false hypothesis
Effect size- distance between the null hypothesis value p0 and the true value p
Reduce Type I and Type II errors by increasing sample size
Generally, reducing Type I errors will increase Type II errors
Part 3
Full transcript