Looking at Data

Gathering Data

Using Data

Exploring Data

Describing patterns and departures from patterns

(I.A) Constructing and Interpreting graphical displays of distributions of univariate data

Dotplot, stemplot, histogram, cumulative frequency plot

Center/Shape/Spread

Clusters/gaps/Outliers

Anticipating Patterns

Exploring random phenomena using probability and simulation

(III.C) The Normal Distribution

Statistical Inference

Estimating population parameters and testing hypotheses

**AP Statistics Topic Outline**

(I.B) Summarizing distributions of univariate data

Measuring center (mean, median), spread (IQR, range, sd)

Measuring position (quartiles, percentiles, standardized scores)

Boxplots

Effect of changing units

(I.C) Comparing distributions of univariate data

Dotplots, back-to-back stemplots, parallel boxplots

Comparing shape, center and spread

Comparing clusters, gaps, outliers

Sampling and Experimentation

Planning and conducting a study

(I.D) Exploring bivariate data

Analyzing patterns in scatterplots

Correlation and linearity

Least-squares regression line

Residual plots, outliers, influential points

Transformations to achieve linearity: logarithmic and power transformations

(I.E) Exploring categorical data

Frequency tables and bar charts

Marginal and joint frequency (two-way tables

Conditional relative frequencies and association

Comparing distributions using bar charts

(II.A) Overview of methods for data collection

Census

Sample Survey

Experiment

Observational Study

(II.B) Planning and conducting surveys

Characteristics of good surveys

Populations, samples, and random selection

Sources of bias

SRS, Stratified, Cluster

(II.C) Planning and conducting experiments

Characteristics of good experiments

Treatments, control groups, experimental units, random assignments and replication

Sources of bias and confounding, including placebo effect and blinding

Completely randomized design

Randomized block design, including matched pairs design

(II.D) Generalizability

Results and types of conclusions

from observational studies

from experiments

from surveys

(III.C.1) Theory

Shape, center and spread

Empirical rule

Finding probabilities from standardized scores using tables

Finding probabilities from standardized scores using calculators

(III.C.1) Applications

Model for measurements

Argue whether a sample came from a Normal population

(III.A) Probability

Interpreting

"Law of large numbers"

AND / OR

Conditional probability

Independence

Simulation

Random Variables

Expected value, Standard Deviation

Linear combinations

Probability Distributions

Binomial

Geometric

(III.B) Combining Random Variables

Independence vs. Dependence

Mean, SD for sums and differences of independent RV

(III.D) Sampling Distributions

Proportion, Mean

Central Limit Theorem

Difference in proportions

Difference in means

t-distribution

chi-square distribution

(IV.A) Estimation

Population parameters

Margins of error

Logic of confidence intervals

CI for proportion, difference in proportion

CI for mean, difference in mean (paired and unpaired)

CI for slope of least-square regression line

(IV.B) Tests of Significance

Logic of significance testing

Null and alternate hypotheses

p-values

one-sided vs. two-sided

Type I and Type II errors

Power

Test for proportion, difference in proportion

Test for mean, difference in mean (paired and unpaired)

Chi-square tests

Homogeneity

Goodness of fit

independence

Test for slope of best-fit line

The AP Exam

Section I

40 multiple choice {90 min}

~~10 minute break~~

Section II

6 Free Response {90 min}

q1-q5: ~12 min each

q6: ~30 min

q6 is an "Investigative task" -- Integrate topics and apply them to new contexts or in a non-routine way

Wednesday, May 13 at 12:00

Mean of Sum / Difference

Probability of A and B

When two events are independent, the probability of both

occurring is the product of the probabilities of the individual events. More formally, if events A and B are independent, then the probability of both A and B occurring is:

P(A and B) = P(A) x P(B)

If you flip a coin twice, what is the probability that it will come up heads both times? Event A is that the coin comes up heads on the first flip and Event B is that the coin comes up heads on the second flip. Since both P(A) and P(B) equal 1/2, the probability that both events occur is:

1/2 x 1/2 = 1/4.

http://onlinestatbook.com/2/probability/basic.html

Variance of Sum / Difference

PROBABILITY

A cluster is formed when several data points lie in a small interval. A gap is an interval that contains no data. An outlier has a value that is much greater than or much less than other data in the set. An outlier may significantly affect the mean of a data set. A single outlier will not affect the mode(s) and is likely to affect the median only slightly. Features such as clusters, gaps, and outliers are more easily seen when the data are shown on a line plot.

Data: A set of measurements or observations taken on a group of objects.

Variable: A characteristic of an object.

Two types of data:

• Quantitative variables

– Weight, family income, number of cups of coffee on a given day.

• Categorical variables

– Gender, college major, satisfaction response on a survey (poor, fair, good, excellent)

Reading Box Plots

The conditional probability of an event B is the probability that the event will occur given the knowledge that an event A has already occurred. This probability is written P(B|A), notation for the probability of B given A. In the case where events A and B are independent (where event A has no effect on the probability of event B), the conditional probability of event B given event A is simply the probability of event B, that is P(B).

If events A and B are not independent, then the probability of the intersection of A and B (the probability that both events occur) is defined by

P(A and B) = P(A)P(B|A).

From this definition, the conditional probability P(B|A) is easily obtained by dividing by P(A)

A histogram is a graphical representation of the distribution of numerical data using bars of different heights

Bivariate data. When we conduct a study that examines the relationship between two variables, we are working with bivariate data. Suppose we conducted a study to see if there were a relationship between the height and weight of high school students. Since we are working with two variables (height and weight), we would be working with bivariate data.

**ayy lmao**

Back to Back Stem plot example

**swag**

Central Limit theorem

If there is an outlier, it would be more accurate to use the mean rather than the median to define the center of the spread.

https://www.khanacademy.org/math/probability/statistics-inferential/sampling_distribution/v/central-limit-theorem

Example of a plot for which a linear regression is inadequate.

Quartile Range

v

v

Simple Random Sampling — every possible sample of the same size is equally likely to be chosen

Systematic sampling is often used instead of random sampling. It is also called an Nth name selection technique. After the required sample size has been calculated, every Nth record is selected from a list of population members. As long as the list does not contain any hidden order, this sampling method is as good as the random sampling method. Its only advantage over the random sampling technique is simplicity. Systematic sampling is frequently used to select a specified number of records from a computer file.

DEFINITION of 'Stratified Random Sampling'

A method of sampling that involves the division of

a population into smaller groups known as strata. In stratified random sampling, the strata are formed based on members' shared attributes or characteristics. A random sample from each stratum is taken in a number proportional to the stratum's size when compared to the population. These subsets of the strata are then pooled to form a random sample.

Means one type of data

1. Linearity refers to whether a data pattern is linear (straight) or nonlinear (curved).

2. Slope refers to the direction of change in variable Y when variable X gets bigger. If variable Y also gets bigger, the slope is positive; but if variable Y gets smaller, the slope is negative.

3. Strength refers to the degree of "scatter" in the plot. If the dots are widely spread, the relationship between variables is weak. If the dots are concentrated around a line, the relationship is strong.

Measures of spread describe how similar or varied the set of observed values are for a particular variable

There are many reasons why the measure of the spread of data values is important, but one of the main reasons regards its relationship with measures of central tendency. A measure of spread gives us an idea of how well the mean, for example, represents the data.

Some ways we can measure spread with the standard deviation, the mean, the variability, median, and so much more.

Linear transformation. A linear transformation preserves linear relationships between variables. Therefore, the correlation between x and y would be unchanged after a linear transformation. Examples of a linear transformation to variable x would be multiplying x by a constant, dividing x by a constant, or adding a constant to x.

Nonlinear tranformation. A nonlinear transformation changes (increases or decreases) linear relationships between variables and, thus, changes the correlation between variables. Examples of a nonlinear transformation of variable x would be taking the square root of x or the reciprocal of x.

A good expierimental design serves three purposes.

Causation. It allows the experimenter to make causal inferences about the relationship between independent variables and a dependent variable.

Control. It allows the experimenter to rule out alternative explanations due to the confounding effects of extraneous variables (i.e., variables other than the independent variables).

Variability. It reduces variability within treatment conditions, which makes it easier to detect differences in treatment outcomes.

Two events are mutually exclusive or disjoint if they cannot occur at the same time.

The complement of an event is the event not occuring. The probability that Event A will not occur is denoted by P(A').

The mean of the discrete random variable X is also called the expected value of X. Notationally, the expected value of X is denoted by E(X). Use the following formula to compute the mean of a discrete random variable.

E(X) = μx = Σ [ xi * P(xi) ]

Discrete. Within a range of numbers, discrete variables can take on only certain values. Suppose, for example, that we flip a coin and count the number of heads. The number of heads will be a value between zero and plus infinity. Within that range, though, the number of heads can be only certain values. For example, the number of heads can only be a whole number, not a fraction. Therefore, the number of heads is a discrete variable. And because the number of heads results from a random process - flipping a coin - it is a discrete random variable.

Univariate means single variable.

Univariate data doesn't look at causes or relationship.

Data: A set of measurements or observations taken on a group of objects.

Variable: A characteristic of an object.

Two types of data:

• Quantitative variables

– Weight, family income, number of cups of coffee on a given day.

• Categorical variables

– Gender, college major, satisfaction response on a survey (poor, fair, good, excellent)