Present Remotely

Send the link below via email or IM

• Invited audience members will follow you as you navigate and present
• People invited to a presentation do not need a Prezi account
• This link expires 10 minutes after you close the presentation

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

STATISTICS

No description
by

selin karatepe

on 16 October 2014

Report abuse

Transcript of STATISTICS

STATISTICS
Statistics is a tool to help us
process, summarize, analyze, and interpret data
for the purpose of making better decisions in an uncertain environment.
DESCRIPTIVE
STATISTICS
Statistical methods, measures, or techniques used to summarize groups of numbers.
INFERENTIAL
STATISTICS
Statistical methods, measures, or techniques used to make decisions based groups of numbers by providing answers to specific types of questions about them.
MEASUREMENT
Measurement is the process by which we examine the world and end up with a description (usually a number) of some aspect of the world.
The results of measurement are specific descriptions of the world.
They are the first step in doing statistics, which results in general descriptions of the world.
SUBJECT
The individual thing (object or event) being measured. Ordinarily, the subject has many attributes, some of which are measurable features.
A subject may be a single person, object, or event, or some unified group or institution.
VALUE
The result of the particular act of measurement. Ordinarily, values are numbers, but they can also be names or other types of identifiers.
Each value usually describes one aspect or feature of the subject on the occasion of the measurement.
VARIABLE
A mathematical abstraction that can take on multiple values. In statistics, each variable usually corresponds to some measurable feature of the subject.
Each measurement usually results in one value of that variable.
UNIT
For some types of measurement, the particular standard measure used to define the meaning of the number, one.
For instance, inches, grams, dollars, minutes, etc., are all units of measurement.
When we say something weighs two and a half pounds, we mean that it weighs two and a half times as much as a standard pound measure.
STATISTICAL STUDY
A project using statistics to describe a particular set of circumstances, to answer a collection of related questions, or to make a collection of related decisions.

Statistical report is a
document presenting the results of a statistical study.

KEY DEFINITIONS
DATA
Facts, especially numerical facts, collected together for reference or information or analysis.
TYPES OF DATA
DATA
The collection of values resulting from a group of measurements.
Usually, each value is labeled by variable and subject, with a timestamp to identify the occasion.
Categorical Data
Data recorded in non-numerical terms. It is called categorical because each different value (such as car model or job title) places the subject in a different category.
Numerical Data
Data recorded in numerical terms. There are different types of numerical data depending upon what numbers the values can be.
Discrete
Discrete data is counted.
Discrete Data can only take certain values.
Continuous
Continuous data is measured
Continuous Data can take any value (within a range)
MEASUREMENT LEVELS
SAMPLING
The process of selecting the individuals from the population that makes up our sample. The details of the sampling procedure are what make for different kinds of sample.
COMPREHENSIVE
SAMPLING
This is when the sample consists of the entire population, at least in principle. Most often, this kind of sample is not possible and when it is possible, it is rarely practical.
RANDOM SAMPLING
This is when the sample is selected randomly from the population.
In this context, randomly means that every member of the population has an equal chance of being selected as part of the sample.
In most situations, this is the best kind of sample to use.
CONVENIENCE
SAMPLING
Selecting the sample by the easiest and/or least costly method available. Whatever kinds of sampling error happen, happen. Convenience sampling is used very often, especially in small studies.
The most important thing to understand about using a convenience sample is to understand the types of errors most likely to happen, given the particular sampling procedure used and the particular population being sampled.
SYSTEMATIC
SAMPLING
This is when the sample is selected by a non- random procedure, such as picking every tenth product unit off of the assembly line for testing or every 50th customer off of a mailing list.
The trick to systematic sampling is that, if the list of items is ordered in a way that is unrelated to the statistical questions of interest, a systematic sample can be just as good as, or even better than, a random sample.
For example, if the customers are listed alphabetically by last name, it may be that every customer of a particular type will have an equal chance of being selected, even if not every customer has a chance of being selected.
The problem is that it is not often easy to determine whether the order really is unrelated to what we want to know.
STRATIFIED
SAMPLING
This is a sophisticated technique used when there are possible problems with ordinary random sampling, most often due to small sample size.It uses known facts about the population to systematically select subpopulations and then random sampling is used within each sub-population. Stratified sampling requires an expert to plan and execute it.
POPULATION
SAMPLE
QUOTA
SAMPLING
This is a variant on the convenience sample common in surveys. Each person responsible for data collection is assigned a quota and then uses convenience sampling, sometimes with restrictions.
An advantage of quota sampling is that different data collectors may find different collection methods convenient. This can prevent the bias created by using just one convenient sampling method.
The biggest problem with a quota sample is that a lot of folks find the same things convenient. In general, the problems of convenience samples apply to quota samples.
PARAMETER
STATISTIC
Similar past events can be used to predict future events.
The more we know about similar decisions in the past and their results, the better we can predict the outcome of the present decision.
The better we can predict the outcome of the present decision, the better we can choose among the alternative courses of action.
Where Is Statistics Used?
Graphical and numerical procedures to summarize and process data
Collect Data
Interviews
Questionnaires
Experiments/ Clinic Trials
Direct Measurements
Observing and Recording
Summarize Data
Measures of Central Tendency
Measures of Variability
Measures of Asymmetry
Present Data
Tables
Graphs
Estimation
Hypothesis
Testing
Point
Estimation
Interval
Estimation
Inference is the process of drawing conclusions or making decisions about a
population
based on
sample
results
STATISTICS
is the science of learning from DATA
The particular occurrence of the particular act of measurement, usually identified by the combination of the subject and the time the measurement is taken.
OCCASION
All of the subjects of interest.
The population can be a group of business transactions, companies, customers, anything we can measure and want to know about. The details of which subjects are and are not part of our population should be carefully specified.
A population is the collection of all items of interest or under investigation.
N
represents the population size.
A parameter is a specific characteristic of a population.
Values calculated using population data are called parameters!
The subjects in the population we actually measure.
There are many ways of picking a sample from a population. Each way has its limitations and difficulties.
It is important to know what kind of sample we are using.
A sample is an observed subset of the population.
n
represents the sample size.
A statistic is a specific characteristic of a sample.
Values calculated using sample data are called statistics!
Population
Sample
SIMPLE RANDOM SAMPLING
is a procedure in which

each member of the population is
chosen strictly by

chance
,
each member of the population is
equally likely to be chosen
,
every possible sample of n objects is
equally likely to be chosen

The resulting sample is called a random sample!
EXERCISES
1.1 State whether each of the following variable is categorical or numerical. If categorical, give the level of measurement. If numerical, is it discrete or continuous?
a. Number of e-mail messages sent daily by a financial planner.
b. Actual cost (in dollars, euros, etc.) of a student's textbooks for a given semester.
c. The actual cost (in dollars, euros, etc.) of your electricity bill last month.
d. Faculty ranks (professor, associate professor, assistant professor, or instructor).
1.2 A new starbucks store recently opened in Istanbul, Turkey. Upon visiting the store, suppose that customers were given a brief survey. Is the answer to each question of the following questions categorical or numerical? If categorical, give the level of measurement. If numerical, is it discrete or continuous?
a. Is this your first visit to Starbucks store?
b. On a scale from 1 (very dissatisfied) to 5 (very satisfied), rate your level of satisfaction with today's purchase?
c. What was the actual cost (in TL) of your purchase today?
1.3 A random sample of tourists in China was asked a series of questions. Identify the type of data that is likely to be used in the answer of each question.
a. What is your favorite tourist destination in China?
b. How many days do you expect to be in China?
c. Do you have children under the age of 10 travelling with you?
d. Rank the following Chinese attractions in order from 1 (most favorite) to 5 (least favorite):
Great Wall; Forbidden City; Terracotta Warriors; Patola Palace; Mogao Caves.
Textbooks and/or References
“Statistics for Business and Economics” by P. Newbold, W.L. Carlson, B. Thorne, Prentice Hall.
Nominal data considered the lowest or weakest type of data, since numerical idenfication is chosen strictly for convenience and does not imply ranking of responses.
The value of nominal variables are words that describe the categories or classes of responses.
The values of the gender variable are male and female; the values of "Do you own an iPhone?" are yes or no. We arbitrarily assign a code or number to each response. However this number has no meaning other than for categorizing.
For example; 1= Male; 2= Female 1=Yes; 2= No.
Ordinal data indicate the rank ordering items, and similar to nominal data the values are words that describes the responses.
Some examples of ordinal data and possible codes are:
Product quality rating (1:Poor; 2: Average; 3:Good)
Consumer preference among three different types of soft drink (1: most preferred; 2:Second Choice; 3: Third Choice)
In these examples the responses are ordinal; or put into a rank order, but there is no measurable meaning to the difference between responses. That is, the difference between your first and second choices may not be the same as your second and third choices.
An interval scale indicates rank and distance from an arbitrary zero measured in unit intervals. That is, data are provided relative to an arbitrarily determined benchmark.
Temperature is a classic example of this level of measurement, with arbitrarily determined benchmarks generally based on Fahrenheit or Celcius degrees. Suppose that it is 80 degrees F in Orlando, and 20 degrees F in Chicago. We can conclude that the difference in temperature is 60 degrees, but we cannot say that is it four times as warm as in Orlando as it is in Chicago .
The year is another example of interval level of measurement, with benchmarks based most commonly on the Gregorian Calender.
Ratio data indicate both rank and distance from a natural zero, with ratios of two measures have meaning.
A person who weighs 100 kg is twice the weight of a person who weighs 50 kg; a person who is 40 years old is twice the age of someone who is 20 years old.
Week 1
Graphical Presentation of Data
Data in raw form are usually not easy to use for decision making
Some type of organization is needed
Table
Graph
The type of graph to use depends on the variable being summarized
Graphs to Describe Categorical Variables
Bar Charts and Pie Charts
If we want to draw attention to
the frequency of each category,
then we will probably use bar chart!
If we want to draw attention to
the proportion of frequencies in each category,
then we will probably use pie chart!
Height of bar or size of pie slice shows the frequency or percentage for each category
A frequency distribution is table used to organized data. The left column (called classes or groups) includes all possible responses on a variable being studied. The right column is a list of the frequencies, or number of observations, for each class.
Pareto Diagram
The Italian economist Vilfredo Pareto (1848-1923) noted that in most cases a small number of factors are responsible for most of the problems.
A Pareto diagram is a bar chart that displays the frequency of defect causes. The bar at the left indicates the most frequent cause and bars to the right indicate causes with decreasing frequencies.
A Pareto diagram is used to separate the "vital few" from the "trivial many".
Pareto's result is applied to a wide variety of behavior over many systems. It sometimes referred to as the "80-20 Rule".
A student might think that 80% of the work on a group project was done by only 20% of the team members.
Graphs to Describe Numerical Variables
Line Charts
A line chart (time-series plot) is used to show the values of a variable over time
Time is measured on the horizontal axis
The variable of interest is measured on the vertical axis
Why we use frequency distributions?
A frequency distribution is a way to summarize data
The distribution condenses the raw data into a more useful form...
and allows for a quick visual interpretation of the data
Histogram
A graph of the data in a frequency distribution is called a histogram
The interval endpoints are shown on the horizontal axis the vertical axis is either frequency, relative frequency, or percentage
Bars of the appropriate heights are used to represent the number of observations within each class
Ogive
An Ogive (a cumulative line graph) is best used when you want to display the total at any given time.
The relative slopes from point to point will indicate greater or lesser increases; for example, a steeper slope means a greater increase than a more gradual slope.
An Ogive, however, is not the ideal graphic for showing comparisons between categories because it simply combines the values in each category and thus indicates an accumulation, a growing or lessening total. If you simply want to keep track of a total and your individual values are periodically combined, an ogive is an appropriate display.
Stem and Leaf Displays
A simple way to see distribution details in a data set
METHOD: Separate the sorted data series into leading digits (the stem) and the trailing digits (the leaves)

Relationships between Variables
Cross Tables (or contingency tables) list the number of observations for every combination of values for two categorical or ordinal variables
If there are r categories for the first variable (rows) and c categories for the second variable (columns), the table is called an r x c cross table.
Communicate complex ideas clearly and accurately
Compressing or distorting the vertical axis
Providing no zero point on the vertical axis
Avoid distortion that might convey the wrong message
Failing to provide a relative basis in comparing data between groups
Present data to display essential information
Unequal histogram interval widths
Data Presentation Errors
One variable is measured on the vertical axis and the other variable is measured on the horizontal axis.
We can prepare a scatter plot by locating one point of each pair of two variables that represent an observation in the data set.
The scatter plot provides a picture of data, including,
The range of each variable
The pattern of variables over the range
A suggestion as to a possible relationship between two variables
An indication of outliers.
EXERCISES
Textbooks and/or References
“Statistics for Business and Economics” by P. Newbold, W.L. Carlson, B. Thorne, Prentice Hall.
Week 2
A university administrator requested a breakdown of travel expenses for faculty to attend various professional meetings. It was found that 31% of the travel expenses were spent for transportation costs, 25% for lodging, 17% for food, 20% for conference registration fees, and the remainder was spent for miscellaneous costs.
a. Construct a pie chart.
b. Construct a bar chart.
A company has determined that there are seven possible defects for one of its product lines. Construct a Pareto diagram for the following defect frequencies:
Defect Code Frequency
A 10
B 70
C 15
D 90
E 8
F 4
G 3

Construct a time series plot for the following number of customers shopping at a new mall during a given week.
DAY Number of Customers
Monday 525
Tuesday 540
Wednesday 469
Thursday 500
Friday 586
Saturday 640
Consider the following data:
17 28 39 39 40 59 12 62 51 41 32 21 13 54 15 24 35 36 44 44 64 65 65 15 37 37 56 59

a. Construct a frequency distribution.
b. Draw a histogram.
c. Draw an ogive.
d. Draw a stem-and-leaf display.
Three subcontractors, A, B, and C, supplied 58, 70, and 72 parts, respectively, to a plant during last week. Of the part supplied by subcontractor A, only four were defective. From the parts supplied by subcontractor B, 60 were good parts; from those supplied by subcontractor C, only six were defective.

a. Set up a cross table for the data.
b. Draw a bar chart.
Beijing Books offers discounted books only priced at \$3, \$5, and \$10. The owner wants to know whether the price has any relationship with the number of days it takes for a customer decide on a purchase. The following data shows the price (X) and the number of days the book was on sale before it was sold (Y). The data is shown (X,Y) in pairs:
(3,7) (5,5) (10,2) (3,9) (5,6) (10,5) (3,6) (5,6)
(10,1) (3,10) (5,7) (10,4) (3,5) (5,6) (10,4)
A random sample of customers was asked to select their favorite soft drink from a list of five brands.
The results showed that 30 preferred brand A, 50 preferred brand B, 46 preferred brand C, 100 preferred brand D, and 14 preferred brand E.

a. Construct a bar chart.
b. Construct a pie chart.
What is the relationship between the price of paint and the demand for this paint? A random sample of (price, quantity) data for 7 days of operation was obtained. Prepare a plot and describe the relationship between quantity and price, with emphasis on any unusual observations.

(10, 100) (8, 120) (5, 200) (4, 200) (10, 90) (7, 110) (6, 150)
A supervisor of a plant kept records of the time (in seconds) that employees needed to complete a particular task. The data are summarized as follows:

Time 30<40 40<50 50<60 60<80 80<100 100<150
Number 10 15 20 30 24 20

a. Graph the data with a histogram.

b. Discuss possible errors.
Describing Data Numerically
Measures of Central Tendency
Measures of Variation
Shape of a Distribution
Measures of the Linear Relationship between two Variables
Measures of Central Tendency
We may construct a histogram to see if the data tend to center or cluster around some value.
Measures of central tendency provide numerical information about a "typical" observation in the data.
A parameter refers to a specific population characteristic; a statistic refers to a specific sample characteristic. Measures of central tendency can be computed for both.
Arithmetic Mean
The mean is the sum of the data values divided by the number of observations.
Median
The median is the middle observation of a set of observation of a set of observations that are arranged in increasing (or decreasing) order.
Mode
The mode, if one exists, the most frequently occurring value.
Geometric Mean
If you are interested in growth over a number of time periods use the geometric mean.
Quartiles
Measures of Variation
The mean alone does not provide a complete or sufficient description of data. While two data sets could have the same mean, the individual observations in one set could vary more from the mean than do the observations in the second data set.
Range
Interquartile Range
The IQR measures the spread in the middle 50% of the data; it is the difference between the third quartile and the first quartile.
Variance
We need a measure that would average the total distance between each of the data values and the mean.
But for all data sets, this sum will always equal to zero since the mean is the center of the data.
If each of this differences squared, then each observation contributes to the sum of the squared terms.
The average of the sum of the squared terms is called the variance

To compute the variance requires squaring the distances, which then changes the unit of measurement to square units!
Standard Deviation
Standard deviation is the positive square root of the variance.
Measures the average spread (variation) around the mean.
Has the same unit of measurement as the original data.
Most commonly used measure of variation.
To calculate the variance and the standard deviation,
Each value in the data set is used (sensitive to outliers)
Values far from the mean are given extra weight (because deviations from the mean are squared)
Coefficient of Variation
The coefficient of variation expresses the standard deviation as a percentage of the mean.
Shape of a Distribution
The shape of a distribution reveals whether data are evenly spread from its center. The shape of a distribution is said to be symmetric if the observations are balanced about its center.
If the distribution is symmetric then the mean is equal to the median and the distribution will have zero skewness.
If, in addition, the distribution is unimodal, then the mean = median = mode.
Skewness
A distribution is skewed, or asymmetric, if the observations are not symmetrically distributed on either side of the center.
Kurtosis
Linear Relationship between two Variables
Covariance and correlation are the numerical measures to describe a
linear
relationship between variables.
Covariance
Covariance (Cov) is a measure of the linear relationship.
A positive value indicates a direct or increasing relationship.
A negative value indicates a decreasing relationship.
The value of the covariance varies if a variable such as height is measured in feet or inches.
Does not provide a measure of the strength of the relationship.
Correlation Coefficient
A standardized measure of linear relationship (Unit free).
Provides both the direction and the strength of a linear relationship.
Computed by dividing the covariance by the product of standard deviations of the two variables.
Ranges between –1 and 1
The closer to –1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker any positive linear relationship
Chebychev's Theorem
For any population with mean μ and standard deviation σ , and k > 1 , the percentage of observations that fall within the interval
[μ + kσ]
Is at least,

The Empirical Rule
Weighted Mean
Some situations require a special type of mean called weighted mean, i.e., calculating GPA, average stock recommendation, approximating the mean in group data.
Approximations for Grouped Data: Mean
Approximations for Grouped Data: Variance
Population Variance
The population variance is the sum of the squared differences between each observation and the population mean divided by the population size, N.
Sample Variance
The sample variance is the sum of the squared differences between each observation and the sample mean divided by the sample size, n, minus 1.
Population Standard Deviation
Sample Standard Deviation
Kurtosis measures both the "peakedness" of the distribution and the heaviness of its tail.
In comparison with normal distribution,
distributions with negative excess kurtosis are called platykurtic distributions.
distributions with positive excess kurtosis are called leptokurtic distributions.
A leptokurtic distribution has a more acute peak around the mean and fatter tails.
A platykurtic distribution has a lower, wider peak around the mean and thinner tails.
negative skew: The left tail is longer; the mass of the distribution is concentrated on the right of the figure. The distribution is said to be left-skewed, left-tailed, or skewed to the left.
positive skew: The right tail is longer; the mass of the distribution is concentrated on the left of the figure. The distribution is said to be right-skewed, right-tailed, or skewed to the right.
Covariance

Correlation
Direction

Strength & Direction
Textbooks and/or References
“Statistics for Business and Economics” by P. Newbold, W.L. Carlson, B. Thorne, Prentice Hall.
EXERCISES

Week 3
1. The time (in seconds) that a random sample of employees took to complete a task is as follows:
23 35 14 37 28 45 12 40 27 13 26 25 37 20 29 49

a. Find the mean, median and mode.
b. Find the standard deviation.
c. Find the coefficient of variation.
d. Find the IQR.
2. A random sample of data has a mean of 75 and a variance of 25.

a. Use Chebychev's theorem to determine the percent of observations between 65 and 85.

b. If the data are mounded, use the empirical rule to find approximate percent of observations between 65 and 85.
3. The annual percentage returns on common stocks over a 7-year period were as follows:
4.0% 14.3% 19.0% -14.7% -26.5% 37.2% 23.8%

Over the same period the annual percentage returns on U.S. Treasury bills were as follows:
6.5% 4.4% 3.8% 6.9% 8.0% 5.8% 5.1%

a. Compare the means of these two population distributions
b. Compare the standard deviations of these two population distributions.
4. Coffee shop customers were randomly surveyed and asked to select a category that described the cost of their recent purchase. The results were:

Cost (in USD) 0<2 2<4 4<6 6<8 8<10
# of Customers 2 3 6 5 4

Find the sample mean and the standard deviation of these costs.
5. A consumer goods company has been studying the effect of advertising on total profits. As part of this study, data on advertising expenditures (in 1000 USD) and total sales (in 1000 USD) were collected for a five month period and are as follows:

(10,100) (15,200) (7,80) (12, 120) (14, 150)

The first number is advertising expenditures and the second is total sales. Plot the data and compute the correlation coefficient. Briefly discuss the relationship.
Full transcript