### Present Remotely

Send the link below via email or IM

• Invited audience members will follow you as you navigate and present
• People invited to a presentation do not need a Prezi account
• This link expires 10 minutes after you close the presentation

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

You can change this under Settings & Account at any time.

# Statistics Vocabulary

No description
by

## Nicolette Zillich

on 2 January 2014

Report abuse

#### Transcript of Statistics Vocabulary

Statistics Vocabulary
Discrete and Continuous Variables
Discrete: A variable having a finite number of values (Whole number values)
Continuous: A variable that takes an infinite number of values between whole numbers. Associated with physical measurement. Rational Number Values
Conditions for a Valid Probability Model
Probability of an event is always between 0 and 1
All possible outcomes of a phenomenon must have a combined probability of 1
The probability of an event not occurring is 1 minus the probability that the event occurs. P(not A)=1-P(A)
Conditional Probability
The conditional probability of an event B is the probability that the event will occur given the knowledge that an event A has already occurred. This probability is written P(B|A), notation for the probability of B given A.
Probability Model
A mathematical representation of a random phenomenon. It is defined by its sample space, events within the sample space, and probabilities associated with each event.
Scatter Plot
A graph that shows the relationship between two quantitative variables (bi-variate data) measured on same intervals.
Observational Study
An experiment that does not involve doing something intentional to the individuals involved
Standard Deviation
The measure of how spread out the numbers are as represented by the symbol sigma, and is calculated by taking the square root of the variance.
Categorical Variable
Places an individual into one of several groups or categories
Categories that are not numerical such as Race and Gender.
Measures of Average
VS
Quantitative Variable
Takes numerical values for which arithmetic operations (adding and averaging) make sense.
Variables measured by numerical values such as Age and Blood Pressure.
Distribution of a Variable
The pattern of variation of a variable; tells us what values the variable takes and how often it takes these values.
An example of distribution is the number of specific colors of M&Ms in a Bag.
Mean of Data
The average of a set of data/variables.
In other words it is the sum divided by the count.
Example:
2+3+5+2
4
=
3
Median of Data
the middle value of the data set when it has been placed in ascending order.
Example:
(2,2,4,6,18)
4 is the Median
Variance
The average of the squared differences of the mean
Calculate the mean of a set of data, then calculate each variable’s difference from the mean. To find the variance, take each difference, square it, then average the result.
Example: A survey, or prolonged observation of students in a school environment.
Vs
Experiment
Is different from an observational study in that it does something intentional to the individuals involved.
Example: One takes a sample of students and gives one group caffenine before a test while the control group has none.
Population
An amount of people, animals or things that data has a chance of being pulled from.
Sample
A portion of a population, taken to survey for a specific quality.
The population of Sherwood High School (2,000)
The Sample of Students picked out of a hat to observe study habits. (20)
Ogive
A cumulative frequency graph
a curve showing the cumulative frequency for a given set of data

Age (years) Frequency
10 5
11 10
12 27
13 18
14 6
15 16
16 38
17 9

Age (years) Frequency Cumulative Frequency
10 5 5
11 10 5 +10 = 15
12 27 15+27 = 42
13 18 42+18 = 60
14 6 60+6 = 66
15 16 66+16 = 82
16 38 82+38 = 120
17 9 120+9 = 129
Histogram
Breaks the range of values of a variable into classes and displays only the count or percent of observations that fall into each class.
First create a frequency table, in this case, of the heights of black cherry trees.
60-64 feet - III
65-69 feet - III
70-74 feet - IIIIIIII
75-79 feet - IIIIIIIIII
80-84 feet - IIIII
85-89 feet - II
Stem and Leaf Plot
A method for showing the frequency with which certain classes of values occur.
To organize numerals, the first place goes on the left of the braket (stem) and the following place values are leaves.
Box and Whisker Plot
A way of displaying numerical data and spread using quartiles and median.
Quartiles
Values that divide numerical data into quarters.
1. Put the numbers in order
2. Cut the list into four equal parts

Inter-Quartile Range
The lower quartile subtravted from the upper quartile.
The range of the "box" in a box and whisker plot
Bar Graph
A graphical display of data using bars of different heights.
Density Curve
a graphical picture of the population distribution of a variable
Properties
Always on or above the x-axis
Has an area of one square units under the curve
The area under the curve gives the proportions of the observations.
It cannot describe actual observations or outliers
Median of a Density Curve
the "equal areas" point that divides the area under the density curve in half.
Mean of a Density Curve
The "balance point" at which the curve would balance if made of solid material.
Normal Distribution
Distributions described by normal curves (a kind of density curve that is symmetric , single peaked and bell shaped)
Example: Heights of people, errors in measurement, standardized test scores, blood pressure.
Empirical Rule
Normal curves follow the Empirical Rule
68% of values are within one standard deviation of the mean
95% of values are within two standard deviations of the mean
99.7% of values are within three standard deviations of the mean
Standard Normal Distribution
Every number in the data has been standardized and the z-scores are plotted.
Z Score/Standardized Score
The number of standard deviations from the mean tells us how close or far off an observation is from the mean.
This is called Z Score/standardized score.
To find Z Score, take an observation, subtract the mean and divide by the standard deviation.
Table A
Normal Probability Plot
a graphical technique for assessing whether or not a data set is approximately normally distributed
the plot compares the data with what would be expected of data that is perfectly normally distributed
If the plot is linear it is normal
Normally Distributed
Skewed Left
Direction/Association
Positive: means above average y-values occur with above average x-values and below average values occur together.
Negative: Above average y-values occur with blow average x-values and vice versa
No Direction/Association: Mean x and y values or it is scattered
Form
Linear Appearance-straight
Non-Linear- Curved
Groups or Clusters
Strength
How close the points one scatter plot are to a line of best fit.
Outliers fall outside the overall pattern and weaken the relationship.
Coefficient of Correlation
A quantitative measure of the strength and direction of a linear association. (r)
r is always between -1 and 1, the signs represent positive or negative correlations. The further from 0, the stronger the correlation.
Coefficient of Determination
Gives the proportion of the variance (fluctuation) of one variable that is predictable from the other variable.
a measure of how well the regression line represents the data
rˆ2
if r = 0.922, then r 2 = 0.850, which means that 85% of the total variation in y can be explained by the linear relationship between x and y
Linear Regression
Sometimes called the line of best fit, attempts to model the relationship between two variables by fitting a linear equation to observed data.
A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).
Least Squares Regression Line
calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (The Residuals)
Residuals
The vertical distance between a data point and the graph of a regression equation.
The residual is positive if the data point is above the graph.
The residual is negative if the data point is below the graph.
The residual is 0 only when the graph passes through the data point.
Residual Plot
Outliers
Influential Points
Extrapolation
Predicted value of y, yˆ=a+bx
The plot of the differences between the LSRL and their corresponding x-values.
If the plot shows no obvious pattern then the LSRL is accurate.
A plot or data point that is visablly distant, in either the negative or positive x or y direction.
An outlier in the x direction (having an extreme x value) is an influential point.
The use of a regression line for prediction outside the range of values of the explanatory variable x used to obtain the line.
Predicting in this way is rarely accurate.
Given y increases by 40 for every increase in x, x being weeks, y being the weight of the rat in grams, once 30 weeks have passed, the weight of the rat is unreasonable.
Logarithmic Transformation of Exponential Data
An exponential data can become linear by . . .
1. (Xi,Yi) -> (Xi,logYi)
2. (Xi,Yi) -> (Xi, lnYi)

If 2^x=y then x=log(base 2)Y
Lurking Variable
A variable that is not amoung the explanatory or response variables in the study, yet may influence the interpretation of relationships among the variables.
Given a study that attempts to correlate test scores with shoe size. It will not be the shoe size that effects the results, but rather the fact that older students have bigger shoes and will score higher than younger students.
Conditional Distribution
An association or comparison that holds for all of several groups can reverse direction when the data is combined to form a single group
In 1973, the University of California-Berkeley was sued for sex discrimination. The numbers looked pretty incriminating: the graduate schools had just accepted 44% of male applicants but only 35% of female applicants.
but,
When the data is distributed by department, the bias is eliminated.
In a two way table, the conditional distribution can be found by calculating the percent of each entry in the column using the column total.
Arthritis
No Arthritis
Elite
Non Elite
Did not Play
Baseball Players
10
61
9
206
24
548
Elite: 14%
Non Elite: 4%
Did no Play: 5%
Common Response
Z
Type of A Plant
X
How much it needs to be watered
Y
How much Sunlight it needs
Confounding
Confounding Variable
How much you ate before swimming
Explanatory Variable
Swimming Skill
Response Variable
Chance of Drowning
Factors
A factor of an experiment is a controlled independent variable; a variable whose levels are set by the experimenter.

A factor is a general type or category of treatments
Levels
Different treatments constitute different levels of a factor.
For example, three different groups of runners are subjected to different training methods. The runners are the experimental units, the training methods, the treatments, where the three types of training methods constitute three levels of the factor 'type of training'.
Simple Random Sample
Stratified Random Sample
Cluster Sample
Systematic Random Sample
Voluntary Response Sample
Convenience Sample
Statistically Significant
Each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample.
Example: Table B, Pulling Names out of a hat
First divideing the population into sub groups, then taking samples from each stratum or sub-group of a population.
Example: You could first divide the population of Sherwood High School into sub groups by grade, then select from those sub groups.
A sampling technique where the entire population is divided into groups, or clusters, and a random sample of these clusters are selected. All observations in the selected clusters are included in the sample.
The Department of Agriculture wishes to investigate the use of pesticides by farmers in England. A cluster sample could be taken by identifying the different counties in England as clusters. A sample of these counties (clusters) would then be chosen at random, so all farmers in those counties selected would be included in the sample.
When the sample for a study is gathered by the response of volunteers who choose to participate.
Is biased in that those who choose to participate often represent more extreme opinions,
Example: Mailing a political survey where only those who wish to participate respond.
A sample that is selected because it is easiest for the researcher of the study
Selecting a sample of persons only in California for a study on American views because California residents are closer.
Response Bias
When respondents answer in a way they think the survey wants them to rather than to their own beliefs.
Example: Answering in a way that makes the respondant seem healthier or smarter. (i.e. how often do you exersize? Do you know what the Water Gate Scandal is?
Non Response Bias
When the responses of the respondents is different from the responses those who weren't surveyed would give.
Example: If only those who happen to exercise daily were surveyed on health habits and not those who do not exersize daily.
Under Coverage
When a specific group is not covered in a sample
Example: If Latinos were not included in a survey on the opinions of immigration laws.
Placebo Effect
When the subject receives a false treatment, but is unaware, believing the treatment to be real. This sometimes causes a response since the body thinks something is meant to be happening to it.
Example: The Sugar Pill
Double Blind
When neither the subject receiving the treatment or the person giving the treatment is aware if the treatment is a placebo or not.
Example: Subjects are receiving shots to improve energy. Half the shots are water, the other half are a new treatment. Neither the patient or the doctor giving the shot know if it is he new treatment or just water.
Principles of Experimental Design
1. Understand the aims of the experiment
2. Identify the experimental units and treatments
3. Avoid Experimental Bias
4. Replicate

Block Design
The subjects are put into groups (blocks) of the same size as the number of treatments. The members of each block are then randomly assigned to different treatment groups
Example: In a study of skin disease, subjects are blocked based on the severity of their disease. The subjects within each block are then randomly assigned to treatments.
Matched Pair Design
Randomly allocates subjects into two groups, giving the treatment to one group, the other receiving a placebo or acting as the control group
Sample Space
(All Americans)
Event
Americans who are Tea Party Members
Independent Events
Multiplication Rule for Dependent Events
Disjoint Events
Complement of an Event
Simulation
Steps of a Simulation
Rules of Mean
Rules of Variance
Expected Value
The random process of obtaining the sample based on an arithmetic sequence.
The Probability of a student receiving an "A" given they study more than half an hour.
An event that is not effected by previous events
The probability of one student doing well on a test and another student doing well are independent events
If events A and B are not independent, then the probability of the intersection of A and B (the probability that both events occur) is defined by
P(A and B) = P(A)P(B|A).
Is only valid when the events are mutually exclusive.
P(A or B) = P(A) + P(B)
If two events are disjoint, then the probability of them both occurring at the same time is 0.
Disjoint: P(A and B) = 0
All the outcomes that are not the event.
If the event is a coin landing on heads, the complement is landing on tails.
A complement is shown as Ac
The imitation of chance behavior, based on a model that accurately reflects the phenomenon.
1. Each outcome is equally likely
2. Each outcome is independent of the other
State the problem
State the assumptions
Assign random digits
Simulate many repetitions