Area Principle

Area Principle

Simpson's Paradox

Representatives who voted for the Civil Rights Act of 1964

House Democrat Republican

Northern 94% (145/154) 85% (138/162)

Southern 7% (7/94) 0% (0/10)

Both 61% (152/248) 80% (138/172)

The Republicans were more concentrated in the north and northern representatives were more supportive of the bill.

Why are Republicans overall more supportive when the Democrats in each section were more supportive?

Fences of Boxplots

Upper fence = Q3 + 1.5(IQR)

Lower fence = Q1 - 1.5(IQR)

TI Tips

For 5 Number Summary:

STAT CALC --> 1-VAR STATS --> ENTER

1 VAR STATS L1 --> ENTER

Standard Deviation

Where "mu" stands for the mean and "sigma" stands for the standard deviation

Shifting and Rescaling Data

68-95-99.7 Rule

TI Tips

2nd DISTR--> normalcdf--> ENTER

normalcdf (low z score, high z score)--> ENTER = Area of normal model

2nd DISTR--> invNorm( --> ENTER

invNorm(Percentile) --> ENTER = Cut point for percentile

Example Problem

A score at or above the 90th percentile is how many standard deviations from the mean?

Answer: 1.28 SD's or more above the mean

Example Problem

Test scores are normally distributed with a mean of 72 and a standard deviation of 4.2 . What is the percentile of a score of 65 (round to the nearest whole percentile)?

Answer: 5th percentile

Example Problem

Given that 10% of the nails made by a manufacturer have a length less than 2.48 in., while 5% have a length greater than 2.54 in., what are the mean and standard deviation of the lengths of the nails? Assume that the lengths have a normal distribution.

Answer:

Mean= 2.506

SD= 0.0205

Scatterplots

Note:

Direction

Form

Scatter/strength

Outliers

Correlation Conditions

Quantitative variables condition

Straight enough condition

Outlier condition

Correlation Properties

Sign of correlation coefficient gives direction of association

Correlation of x with y is same as correlation of y with x

Correlation has no units

Correlation not affected by changes in center or scale of either variable

Correlation is sensitive to outliers

And...

Correlation does not prove causation!

Answer: B

Answer: D

Time for Mr. Whiskers!

Linear Regressions!

Linear Regressions

Residual: Difference between observed value and predicted value

Line of best fit: line for which sum of squared residuals is smallest

Linear Regressions

Regression Equation (y) = a + bx

Slope of regression line:

b = r(sy/sx)

Slope of best fit line for z-scores is the correlation, r

Types of Variables

Catagorical

Quantitative

Mean is skewed towards tail

Remember SOCS!

Adding/Subtracting

Center and Percentiles

Spread

Multiplying/Dividing

Center

Percentiles

Spread

5 Number Summary

Max

Q3

Median

Q1

Min

Skewed: Median & IQR

Symmetric: Mean & SD

Describing Distribution

Describing Distribution

Normal Model: N(

,

)

We know our variables. What's their relationship?

Prediction

Categorical Data

Beware the...

Quantitative Data

r^2 = the percentage of variability in y that is explained by x

TI Tips

STAT CALC -> LinReg (a+bx) -> ENTER

See equation, r, r^2

For residuals plot:

Go to STATPLOT -> x list: L1, y list: RESID -> Turn plot on, zoom stat

Beware...

Don't fit a straight line to a non-linear relationship

Extraordinary/influential points

Extrapolation

Lurking variables

Randomness

**Woah woah woah, hold up! Before all of this, we need the data!**

**Where it all begins...**

Simulation

1) Assign digits

2) Define trial

3) Define success

4) Run simulation

5) Analyze results

6) State conclusion

TI Tips

To seed (ex. seed 5): enter 5, then MATH, then PRB, then rand, hit ENTER

MATH PRB, randInt(0,9, 5) etc. , ENTER

Probability

Addition Rule: P (A U B) = P(A) + P(B), A and B must be disjoint

Multiplication Rule:

P(A B)= P(A) * P(B), A and B must be independent

U

Practice Problem:

A certain bowler can bowl a strike 70% of the time. What is the probability that she:

a) goes 3 consecutive frames without a strike

b) makes her first strike in the 3rd frame?

c) has at least one strike in the first 3 frames?

d) bowls a perfect game (12 consecutive strikes)?

Answers:

a) 0.027

b) 0.063

c) 0.973

d) 0.0138

Now to Probability Rules

Addition Rule: P(A U B) = P(A) +P(B) - P(A B)

U

Computer Regression Analysis

Make Distribution more symmetric

Make the spread of several groups more alike

Make the form of a scatterplot more linear

The Power of Re-Expression

Ladder of Powers

Power

Use For

2

1

1/2

"0" or log

-1/2

-1

Unimodal distributions

No change

Counts

Non-zero measurements

-----

Ratios

Sample Surveys

Randomize to protect against effects we aren't aware of

Types of Sampling Designs

Simple Random Sample

Stratified

Cluster

Multistage

Systematic

Beware of:

Non-response bias

Response bias

Undercoverage Bias

Calling at 2 PM

P(B|A) = P(A B) / P(A)

If independent, P(B|A) = P(B)

We can use the various sampling methods in an experiment.

Experiments can prove a relationship between two variables.

4 Principles of Experimental Design

1) Control

2) Randomize

3) Replicate

4) Block- remove variability due to difference among blocks

Those lurking variables...

Common response-both the explanatory and response variable are influenced by the 3rd variable

Confounding-When 2 variables are entwined in such a way that you can't tell which are having an effect on the response variable

**Part 2**

U

P(A B) = P(A) x P(B|A)

U

General Multiplication Rule

Example Problem:

Answer: E, use P(California | Phone)

Okay, I know how to find my probabilities. What's next?

Building a probability model to calculate expected values and standard deviations!

Expected Value: E(X) = u = x * P(X=x)

Example: E(X) = 10,000(1/1000) +5000(2/1000) + 0(997/1000)

Standard deviation:

(SD(X))^2 = (x - u)^2 * P(X=x)

SD(X) = sqrt(VAR (X))

Shifts in Data:

E(X plus or minus c) = E(X) plus or minus c

VAR (X plus or minus c) = VAR (X)

E(aX) = aE(X)

VAR(aX) = a^2 Var(X)

E(X plus or minus Y) = E(X) plus or minus E(Y)

VAR(X plus or minus Y) = VAR(X) + VAR(Y)

Is your situation a Bernoulli trial?

Bernoulli trials: 1) have only 2 possible outcomes

2) have a constant probability of success

3) have independent trials

Yes?

You can use two models!

Geometric- to find how long it will take to achieve a success, order does matter

u = 1/p

SD = sqrt(q/p^2)

Binomial- interested in the number of successes, order doesn't matter

u= np

SD = sqrt(npq)

P(X=x)= q^x-1 * p

P(X=x) = (n!/(x!(n-x)!)) *p^x * q^n-x

TI Tips

2nd DISTR geometpdf (p,x) - ex. probability of finding a picture in the 5th box

geometcdf(p, x) - ex. probability of finding first success on or before the xth trial

binompdf(n,p,x) - ex. find probability of finding a card exactly twice in 5 boxes

binomcdf(n,p,x) - total probability of getting x or fewer successes among n trials

A binomial model is approx. Normal if np >= 10 and nq >= 10

**From the hypothetical to the real**

Sampling Distribution Models

SD (p-hat) = sqrt(pq/n)

Proportions:

center at p

Conditions:

1) 10% condition

2) Success/failure condition

Central Limit Theorem: As sample size n increases, the mean of n independent values has a sampling distribution that tends toward a Normal model with mean u(y) = population mean, u, and SD(y) = SD/sqrt(n)

Note 3 conditions for CLT: 1) Random sampling condition

2) Independence assumption

3) 10% condition

Also note: the Standard deviation of a sampling distribution is called the

standard error

**How do we determine a range for the real proportion, and how confident are we that the real proportion is within this interval?**

**Confidence Interval = p-hat +/- ME**

ME = z* x SE(p hat)

ME = z* x SE(p hat)

**Must meet conditions:**

1) Independence

2) Random

3) 10% condition

4) Success/failure condition

1) Independence

2) Random

3) 10% condition

4) Success/failure condition

**In the calculator, you can use 1-PropZInt to determine the CI : STAT -> Tests-> A**

**Using these to support hypotheses**

**Ho (AKA null hypothesis) - proportion of ___ has not changed**

Ha (AKA alternate hypothesis) - proportion of __ =/= or > or < original p

SD( p hat) = sqrt(pq/n)

Use one-proportion z-test

z = (p hat - p0) / SD (p hat)

Ha (AKA alternate hypothesis) - proportion of __ =/= or > or < original p

SD( p hat) = sqrt(pq/n)

Use one-proportion z-test

z = (p hat - p0) / SD (p hat)

**3 conditions**

1) Independence

2) Random sampling

3) 10% condition

4) Success/failure condition

np and nq both greater than 10

1) Independence

2) Random sampling

3) 10% condition

4) Success/failure condition

np and nq both greater than 10

**significance level generally equals 0.05 or 0.01**

**Errors:**

Type I Error - Ho is true but you rejected it

Type II Error - Ho is false but you retain it

Type I Error - Ho is true but you rejected it

Type II Error - Ho is false but you retain it

**Power: the probability of correctly rejecting a false hypothesis**

**Effect size- distance between the null hypothesis value p0 and the true value p**

**Reduce Type I and Type II errors by increasing sample size**

**Generally, reducing Type I errors will increase Type II errors**

**Part 3**