**Statistics Vocabulary**

**Discrete and Continuous Variables**

Discrete: A variable having a finite number of values (Whole number values)

Continuous: A variable that takes an infinite number of values between whole numbers. Associated with physical measurement. Rational Number Values

Conditions for a Valid Probability Model

Probability of an event is always between 0 and 1

All possible outcomes of a phenomenon must have a combined probability of 1

The probability of an event not occurring is 1 minus the probability that the event occurs. P(not A)=1-P(A)

Conditional Probability

The conditional probability of an event B is the probability that the event will occur given the knowledge that an event A has already occurred. This probability is written P(B|A), notation for the probability of B given A.

Probability Model

A mathematical representation of a random phenomenon. It is defined by its sample space, events within the sample space, and probabilities associated with each event.

Scatter Plot

A graph that shows the relationship between two quantitative variables (bi-variate data) measured on same intervals.

Observational Study

An experiment that does not involve doing something intentional to the individuals involved

Standard Deviation

The measure of how spread out the numbers are as represented by the symbol sigma, and is calculated by taking the square root of the variance.

Categorical Variable

Places an individual into one of several groups or categories

Categories that are not numerical such as Race and Gender.

Measures of Average

VS

Quantitative Variable

Takes numerical values for which arithmetic operations (adding and averaging) make sense.

Variables measured by numerical values such as Age and Blood Pressure.

Distribution of a Variable

The pattern of variation of a variable; tells us what values the variable takes and how often it takes these values.

An example of distribution is the number of specific colors of M&Ms in a Bag.

Mean of Data

The average of a set of data/variables.

In other words it is the sum divided by the count.

Example:

2+3+5+2

4

=

3

Median of Data

the middle value of the data set when it has been placed in ascending order.

Example:

(2,2,4,6,18)

4 is the Median

Variance

The average of the squared differences of the mean

Calculate the mean of a set of data, then calculate each variable’s difference from the mean. To find the variance, take each difference, square it, then average the result.

Example: A survey, or prolonged observation of students in a school environment.

Vs

Experiment

Is different from an observational study in that it does something intentional to the individuals involved.

Example: One takes a sample of students and gives one group caffenine before a test while the control group has none.

Population

An amount of people, animals or things that data has a chance of being pulled from.

Sample

A portion of a population, taken to survey for a specific quality.

The population of Sherwood High School (2,000)

The Sample of Students picked out of a hat to observe study habits. (20)

Ogive

A cumulative frequency graph

a curve showing the cumulative frequency for a given set of data

Age (years) Frequency

10 5

11 10

12 27

13 18

14 6

15 16

16 38

17 9

Age (years) Frequency Cumulative Frequency

10 5 5

11 10 5 +10 = 15

12 27 15+27 = 42

13 18 42+18 = 60

14 6 60+6 = 66

15 16 66+16 = 82

16 38 82+38 = 120

17 9 120+9 = 129

Histogram

Breaks the range of values of a variable into classes and displays only the count or percent of observations that fall into each class.

First create a frequency table, in this case, of the heights of black cherry trees.

60-64 feet - III

65-69 feet - III

70-74 feet - IIIIIIII

75-79 feet - IIIIIIIIII

80-84 feet - IIIII

85-89 feet - II

Stem and Leaf Plot

A method for showing the frequency with which certain classes of values occur.

To organize numerals, the first place goes on the left of the braket (stem) and the following place values are leaves.

Box and Whisker Plot

A way of displaying numerical data and spread using quartiles and median.

Quartiles

Values that divide numerical data into quarters.

1. Put the numbers in order

2. Cut the list into four equal parts

Inter-Quartile Range

The lower quartile subtravted from the upper quartile.

The range of the "box" in a box and whisker plot

Bar Graph

A graphical display of data using bars of different heights.

Density Curve

a graphical picture of the population distribution of a variable

Properties

Always on or above the x-axis

Has an area of one square units under the curve

The area under the curve gives the proportions of the observations.

It cannot describe actual observations or outliers

Median of a Density Curve

the "equal areas" point that divides the area under the density curve in half.

Mean of a Density Curve

The "balance point" at which the curve would balance if made of solid material.

Normal Distribution

Distributions described by normal curves (a kind of density curve that is symmetric , single peaked and bell shaped)

Example: Heights of people, errors in measurement, standardized test scores, blood pressure.

Empirical Rule

Normal curves follow the Empirical Rule

68% of values are within one standard deviation of the mean

95% of values are within two standard deviations of the mean

99.7% of values are within three standard deviations of the mean

Standard Normal Distribution

Every number in the data has been standardized and the z-scores are plotted.

Z Score/Standardized Score

The number of standard deviations from the mean tells us how close or far off an observation is from the mean.

This is called Z Score/standardized score.

To find Z Score, take an observation, subtract the mean and divide by the standard deviation.

Table A

Normal Probability Plot

a graphical technique for assessing whether or not a data set is approximately normally distributed

the plot compares the data with what would be expected of data that is perfectly normally distributed

If the plot is linear it is normal

Normally Distributed

Skewed Left

Direction/Association

Positive: means above average y-values occur with above average x-values and below average values occur together.

Negative: Above average y-values occur with blow average x-values and vice versa

No Direction/Association: Mean x and y values or it is scattered

Form

Linear Appearance-straight

Non-Linear- Curved

Groups or Clusters

Strength

How close the points one scatter plot are to a line of best fit.

Outliers fall outside the overall pattern and weaken the relationship.

Coefficient of Correlation

A quantitative measure of the strength and direction of a linear association. (r)

r is always between -1 and 1, the signs represent positive or negative correlations. The further from 0, the stronger the correlation.

Coefficient of Determination

Gives the proportion of the variance (fluctuation) of one variable that is predictable from the other variable.

a measure of how well the regression line represents the data

rˆ2

if r = 0.922, then r 2 = 0.850, which means that 85% of the total variation in y can be explained by the linear relationship between x and y

Linear Regression

Sometimes called the line of best fit, attempts to model the relationship between two variables by fitting a linear equation to observed data.

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).

Least Squares Regression Line

calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (The Residuals)

Residuals

The vertical distance between a data point and the graph of a regression equation.

The residual is positive if the data point is above the graph.

The residual is negative if the data point is below the graph.

The residual is 0 only when the graph passes through the data point.

Residual Plot

Outliers

Influential Points

Extrapolation

Predicted value of y, yˆ=a+bx

The plot of the differences between the LSRL and their corresponding x-values.

If the plot shows no obvious pattern then the LSRL is accurate.

A plot or data point that is visablly distant, in either the negative or positive x or y direction.

An outlier in the x direction (having an extreme x value) is an influential point.

The use of a regression line for prediction outside the range of values of the explanatory variable x used to obtain the line.

Predicting in this way is rarely accurate.

Given y increases by 40 for every increase in x, x being weeks, y being the weight of the rat in grams, once 30 weeks have passed, the weight of the rat is unreasonable.

Logarithmic Transformation of Exponential Data

An exponential data can become linear by . . .

1. (Xi,Yi) -> (Xi,logYi)

2. (Xi,Yi) -> (Xi, lnYi)

If 2^x=y then x=log(base 2)Y

Lurking Variable

A variable that is not amoung the explanatory or response variables in the study, yet may influence the interpretation of relationships among the variables.

Given a study that attempts to correlate test scores with shoe size. It will not be the shoe size that effects the results, but rather the fact that older students have bigger shoes and will score higher than younger students.

Conditional Distribution

Simpson's Paradox

An association or comparison that holds for all of several groups can reverse direction when the data is combined to form a single group

In 1973, the University of California-Berkeley was sued for sex discrimination. The numbers looked pretty incriminating: the graduate schools had just accepted 44% of male applicants but only 35% of female applicants.

but,

When the data is distributed by department, the bias is eliminated.

In a two way table, the conditional distribution can be found by calculating the percent of each entry in the column using the column total.

Arthritis

No Arthritis

Elite

Non Elite

Did not Play

Baseball Players

10

61

9

206

24

548

Elite: 14%

Non Elite: 4%

Did no Play: 5%

Common Response

Z

Type of A Plant

X

How much it needs to be watered

Y

How much Sunlight it needs

Confounding

Confounding Variable

How much you ate before swimming

Explanatory Variable

Swimming Skill

Response Variable

Chance of Drowning

Factors

A factor of an experiment is a controlled independent variable; a variable whose levels are set by the experimenter.

A factor is a general type or category of treatments

Levels

Different treatments constitute different levels of a factor.

For example, three different groups of runners are subjected to different training methods. The runners are the experimental units, the training methods, the treatments, where the three types of training methods constitute three levels of the factor 'type of training'.

Simple Random Sample

Stratified Random Sample

Cluster Sample

Systematic Random Sample

Voluntary Response Sample

Convenience Sample

Statistically Significant

Each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample.

Example: Table B, Pulling Names out of a hat

First divideing the population into sub groups, then taking samples from each stratum or sub-group of a population.

Example: You could first divide the population of Sherwood High School into sub groups by grade, then select from those sub groups.

A sampling technique where the entire population is divided into groups, or clusters, and a random sample of these clusters are selected. All observations in the selected clusters are included in the sample.

The Department of Agriculture wishes to investigate the use of pesticides by farmers in England. A cluster sample could be taken by identifying the different counties in England as clusters. A sample of these counties (clusters) would then be chosen at random, so all farmers in those counties selected would be included in the sample.

When the sample for a study is gathered by the response of volunteers who choose to participate.

Is biased in that those who choose to participate often represent more extreme opinions,

Example: Mailing a political survey where only those who wish to participate respond.

A sample that is selected because it is easiest for the researcher of the study

Selecting a sample of persons only in California for a study on American views because California residents are closer.

Response Bias

When respondents answer in a way they think the survey wants them to rather than to their own beliefs.

Example: Answering in a way that makes the respondant seem healthier or smarter. (i.e. how often do you exersize? Do you know what the Water Gate Scandal is?

Non Response Bias

When the responses of the respondents is different from the responses those who weren't surveyed would give.

Example: If only those who happen to exercise daily were surveyed on health habits and not those who do not exersize daily.

Under Coverage

When a specific group is not covered in a sample

Example: If Latinos were not included in a survey on the opinions of immigration laws.

Placebo Effect

When the subject receives a false treatment, but is unaware, believing the treatment to be real. This sometimes causes a response since the body thinks something is meant to be happening to it.

Example: The Sugar Pill

Double Blind

When neither the subject receiving the treatment or the person giving the treatment is aware if the treatment is a placebo or not.

Example: Subjects are receiving shots to improve energy. Half the shots are water, the other half are a new treatment. Neither the patient or the doctor giving the shot know if it is he new treatment or just water.

Principles of Experimental Design

1. Understand the aims of the experiment

2. Identify the experimental units and treatments

3. Avoid Experimental Bias

4. Replicate

Block Design

The subjects are put into groups (blocks) of the same size as the number of treatments. The members of each block are then randomly assigned to different treatment groups

Example: In a study of skin disease, subjects are blocked based on the severity of their disease. The subjects within each block are then randomly assigned to treatments.

Matched Pair Design

Randomly allocates subjects into two groups, giving the treatment to one group, the other receiving a placebo or acting as the control group

Sample Space

(All Americans)

Event

Americans who are Tea Party Members

Independent Events

Addition Rule for Joint Events

Multiplication Rule for Dependent Events

Disjoint Events

Complement of an Event

Simulation

Steps of a Simulation

Rules of Mean

Rules of Variance

Expected Value

The random process of obtaining the sample based on an arithmetic sequence.

The Probability of a student receiving an "A" given they study more than half an hour.

An event that is not effected by previous events

The probability of one student doing well on a test and another student doing well are independent events

If events A and B are not independent, then the probability of the intersection of A and B (the probability that both events occur) is defined by

P(A and B) = P(A)P(B|A).

Is only valid when the events are mutually exclusive.

P(A or B) = P(A) + P(B)

If two events are disjoint, then the probability of them both occurring at the same time is 0.

Disjoint: P(A and B) = 0

All the outcomes that are not the event.

If the event is a coin landing on heads, the complement is landing on tails.

A complement is shown as Ac

The imitation of chance behavior, based on a model that accurately reflects the phenomenon.

1. Each outcome is equally likely

2. Each outcome is independent of the other

State the problem

State the assumptions

Assign random digits

Simulate many repetitions

State your conclusion

The possible values that a discrete variable takes are not equally likely

Mean of a discrete variable is the weighted average of the probabilities of its values

The mean of discrete variable x is also called it's expected value.

If X is a random variable and a and b are fixed numbers, then...

If X and Y are independent random variables, then

Gives a measure of the center of the distribution of the variable

**by Nicolette Zillich**