**STATISTICS**

Statistics is a tool to help us

process, summarize, analyze, and interpret data

for the purpose of making better decisions in an uncertain environment.

DESCRIPTIVE

STATISTICS

Statistical methods, measures, or techniques used to summarize groups of numbers.

INFERENTIAL

STATISTICS

Statistical methods, measures, or techniques used to make decisions based groups of numbers by providing answers to specific types of questions about them.

MEASUREMENT

Measurement is the process by which we examine the world and end up with a description (usually a number) of some aspect of the world.

The results of measurement are specific descriptions of the world.

They are the first step in doing statistics, which results in general descriptions of the world.

SUBJECT

The individual thing (object or event) being measured. Ordinarily, the subject has many attributes, some of which are measurable features.

A subject may be a single person, object, or event, or some unified group or institution.

VALUE

The result of the particular act of measurement. Ordinarily, values are numbers, but they can also be names or other types of identifiers.

Each value usually describes one aspect or feature of the subject on the occasion of the measurement.

VARIABLE

A mathematical abstraction that can take on multiple values. In statistics, each variable usually corresponds to some measurable feature of the subject.

Each measurement usually results in one value of that variable.

UNIT

For some types of measurement, the particular standard measure used to define the meaning of the number, one.

For instance, inches, grams, dollars, minutes, etc., are all units of measurement.

When we say something weighs two and a half pounds, we mean that it weighs two and a half times as much as a standard pound measure.

STATISTICAL STUDY

A project using statistics to describe a particular set of circumstances, to answer a collection of related questions, or to make a collection of related decisions.

Statistical report is a

document presenting the results of a statistical study.

**KEY DEFINITIONS**

DATA

Facts, especially numerical facts, collected together for reference or information or analysis.

TYPES OF DATA

DATA

The collection of values resulting from a group of measurements.

Usually, each value is labeled by variable and subject, with a timestamp to identify the occasion.

Categorical Data

Data recorded in non-numerical terms. It is called categorical because each different value (such as car model or job title) places the subject in a different category.

Numerical Data

Data recorded in numerical terms. There are different types of numerical data depending upon what numbers the values can be.

Discrete

Discrete data is counted.

Discrete Data can only take certain values.

Continuous

Continuous data is measured

Continuous Data can take any value (within a range)

MEASUREMENT LEVELS

**SAMPLING**

The process of selecting the individuals from the population that makes up our sample. The details of the sampling procedure are what make for different kinds of sample.

COMPREHENSIVE

SAMPLING

This is when the sample consists of the entire population, at least in principle. Most often, this kind of sample is not possible and when it is possible, it is rarely practical.

RANDOM SAMPLING

This is when the sample is selected randomly from the population.

In this context, randomly means that every member of the population has an equal chance of being selected as part of the sample.

In most situations, this is the best kind of sample to use.

CONVENIENCE

SAMPLING

Selecting the sample by the easiest and/or least costly method available. Whatever kinds of sampling error happen, happen. Convenience sampling is used very often, especially in small studies.

The most important thing to understand about using a convenience sample is to understand the types of errors most likely to happen, given the particular sampling procedure used and the particular population being sampled.

SYSTEMATIC

SAMPLING

This is when the sample is selected by a non- random procedure, such as picking every tenth product unit off of the assembly line for testing or every 50th customer off of a mailing list.

The trick to systematic sampling is that, if the list of items is ordered in a way that is unrelated to the statistical questions of interest, a systematic sample can be just as good as, or even better than, a random sample.

For example, if the customers are listed alphabetically by last name, it may be that every customer of a particular type will have an equal chance of being selected, even if not every customer has a chance of being selected.

The problem is that it is not often easy to determine whether the order really is unrelated to what we want to know.

STRATIFIED

SAMPLING

This is a sophisticated technique used when there are possible problems with ordinary random sampling, most often due to small sample size.It uses known facts about the population to systematically select subpopulations and then random sampling is used within each sub-population. Stratified sampling requires an expert to plan and execute it.

**POPULATION**

**SAMPLE**

QUOTA

SAMPLING

This is a variant on the convenience sample common in surveys. Each person responsible for data collection is assigned a quota and then uses convenience sampling, sometimes with restrictions.

An advantage of quota sampling is that different data collectors may find different collection methods convenient. This can prevent the bias created by using just one convenient sampling method.

The biggest problem with a quota sample is that a lot of folks find the same things convenient. In general, the problems of convenience samples apply to quota samples.

**PARAMETER**

STATISTIC

Similar past events can be used to predict future events.

The more we know about similar decisions in the past and their results, the better we can predict the outcome of the present decision.

The better we can predict the outcome of the present decision, the better we can choose among the alternative courses of action.

Where Is Statistics Used?

Graphical and numerical procedures to summarize and process data

Collect Data

Interviews

Questionnaires

Experiments/ Clinic Trials

Direct Measurements

Observing and Recording

Summarize Data

Measures of Central Tendency

Measures of Variability

Measures of Asymmetry

Present Data

Tables

Graphs

Estimation

Hypothesis

Testing

Point

Estimation

Interval

Estimation

Inference is the process of drawing conclusions or making decisions about a

population

based on

sample

results

STATISTICS

is the science of learning from DATA

The particular occurrence of the particular act of measurement, usually identified by the combination of the subject and the time the measurement is taken.

OCCASION

All of the subjects of interest.

The population can be a group of business transactions, companies, customers, anything we can measure and want to know about. The details of which subjects are and are not part of our population should be carefully specified.

A population is the collection of all items of interest or under investigation.

N

represents the population size.

A parameter is a specific characteristic of a population.

Values calculated using population data are called parameters!

The subjects in the population we actually measure.

There are many ways of picking a sample from a population. Each way has its limitations and difficulties.

It is important to know what kind of sample we are using.

A sample is an observed subset of the population.

n

represents the sample size.

A statistic is a specific characteristic of a sample.

Values calculated using sample data are called statistics!

Population

Sample

SIMPLE RANDOM SAMPLING

is a procedure in which

each member of the population is

chosen strictly by

chance

,

each member of the population is

equally likely to be chosen

,

every possible sample of n objects is

equally likely to be chosen

The resulting sample is called a random sample!

**EXERCISES**

1.1 State whether each of the following variable is categorical or numerical. If categorical, give the level of measurement. If numerical, is it discrete or continuous?

a. Number of e-mail messages sent daily by a financial planner.

b. Actual cost (in dollars, euros, etc.) of a student's textbooks for a given semester.

c. The actual cost (in dollars, euros, etc.) of your electricity bill last month.

d. Faculty ranks (professor, associate professor, assistant professor, or instructor).

1.2 A new starbucks store recently opened in Istanbul, Turkey. Upon visiting the store, suppose that customers were given a brief survey. Is the answer to each question of the following questions categorical or numerical? If categorical, give the level of measurement. If numerical, is it discrete or continuous?

a. Is this your first visit to Starbucks store?

b. On a scale from 1 (very dissatisfied) to 5 (very satisfied), rate your level of satisfaction with today's purchase?

c. What was the actual cost (in TL) of your purchase today?

1.3 A random sample of tourists in China was asked a series of questions. Identify the type of data that is likely to be used in the answer of each question.

a. What is your favorite tourist destination in China?

b. How many days do you expect to be in China?

c. Do you have children under the age of 10 travelling with you?

d. Rank the following Chinese attractions in order from 1 (most favorite) to 5 (least favorite):

Great Wall; Forbidden City; Terracotta Warriors; Patola Palace; Mogao Caves.

Textbooks and/or References

“Statistics for Business and Economics” by P. Newbold, W.L. Carlson, B. Thorne, Prentice Hall.

Nominal data considered the lowest or weakest type of data, since numerical idenfication is chosen strictly for convenience and does not imply ranking of responses.

The value of nominal variables are words that describe the categories or classes of responses.

The values of the gender variable are male and female; the values of "Do you own an iPhone?" are yes or no. We arbitrarily assign a code or number to each response. However this number has no meaning other than for categorizing.

For example; 1= Male; 2= Female 1=Yes; 2= No.

Ordinal data indicate the rank ordering items, and similar to nominal data the values are words that describes the responses.

Some examples of ordinal data and possible codes are:

Product quality rating (1:Poor; 2: Average; 3:Good)

Consumer preference among three different types of soft drink (1: most preferred; 2:Second Choice; 3: Third Choice)

In these examples the responses are ordinal; or put into a rank order, but there is no measurable meaning to the difference between responses. That is, the difference between your first and second choices may not be the same as your second and third choices.

An interval scale indicates rank and distance from an arbitrary zero measured in unit intervals. That is, data are provided relative to an arbitrarily determined benchmark.

Temperature is a classic example of this level of measurement, with arbitrarily determined benchmarks generally based on Fahrenheit or Celcius degrees. Suppose that it is 80 degrees F in Orlando, and 20 degrees F in Chicago. We can conclude that the difference in temperature is 60 degrees, but we cannot say that is it four times as warm as in Orlando as it is in Chicago .

The year is another example of interval level of measurement, with benchmarks based most commonly on the Gregorian Calender.

Ratio data indicate both rank and distance from a natural zero, with ratios of two measures have meaning.

A person who weighs 100 kg is twice the weight of a person who weighs 50 kg; a person who is 40 years old is twice the age of someone who is 20 years old.

**Week 1**

**Graphical Presentation of Data**

Data in raw form are usually not easy to use for decision making

Some type of organization is needed

Table

Graph

The type of graph to use depends on the variable being summarized

Graphs to Describe Categorical Variables

Bar Charts and Pie Charts

If we want to draw attention to

the frequency of each category,

then we will probably use bar chart!

If we want to draw attention to

the proportion of frequencies in each category,

then we will probably use pie chart!

Height of bar or size of pie slice shows the frequency or percentage for each category

A frequency distribution is table used to organized data. The left column (called classes or groups) includes all possible responses on a variable being studied. The right column is a list of the frequencies, or number of observations, for each class.

Pareto Diagram

The Italian economist Vilfredo Pareto (1848-1923) noted that in most cases a small number of factors are responsible for most of the problems.

A Pareto diagram is a bar chart that displays the frequency of defect causes. The bar at the left indicates the most frequent cause and bars to the right indicate causes with decreasing frequencies.

A Pareto diagram is used to separate the "vital few" from the "trivial many".

Pareto's result is applied to a wide variety of behavior over many systems. It sometimes referred to as the "80-20 Rule".

A student might think that 80% of the work on a group project was done by only 20% of the team members.

Graphs to Describe Numerical Variables

Line Charts

A line chart (time-series plot) is used to show the values of a variable over time

Time is measured on the horizontal axis

The variable of interest is measured on the vertical axis

Why we use frequency distributions?

A frequency distribution is a way to summarize data

The distribution condenses the raw data into a more useful form...

and allows for a quick visual interpretation of the data

Histogram

A graph of the data in a frequency distribution is called a histogram

The interval endpoints are shown on the horizontal axis the vertical axis is either frequency, relative frequency, or percentage

Bars of the appropriate heights are used to represent the number of observations within each class

Ogive

An Ogive (a cumulative line graph) is best used when you want to display the total at any given time.

The relative slopes from point to point will indicate greater or lesser increases; for example, a steeper slope means a greater increase than a more gradual slope.

An Ogive, however, is not the ideal graphic for showing comparisons between categories because it simply combines the values in each category and thus indicates an accumulation, a growing or lessening total. If you simply want to keep track of a total and your individual values are periodically combined, an ogive is an appropriate display.

Stem and Leaf Displays

A simple way to see distribution details in a data set

METHOD: Separate the sorted data series into leading digits (the stem) and the trailing digits (the leaves)

Relationships between Variables

Cross Tables (or contingency tables) list the number of observations for every combination of values for two categorical or ordinal variables

If there are r categories for the first variable (rows) and c categories for the second variable (columns), the table is called an r x c cross table.

Communicate complex ideas clearly and accurately

Compressing or distorting the vertical axis

Providing no zero point on the vertical axis

Avoid distortion that might convey the wrong message

Failing to provide a relative basis in comparing data between groups

Present data to display essential information

Unequal histogram interval widths

**Data Presentation Errors**

One variable is measured on the vertical axis and the other variable is measured on the horizontal axis.

We can prepare a scatter plot by locating one point of each pair of two variables that represent an observation in the data set.

The scatter plot provides a picture of data, including,

The range of each variable

The pattern of variables over the range

A suggestion as to a possible relationship between two variables

An indication of outliers.

**EXERCISES**

Textbooks and/or References

“Statistics for Business and Economics” by P. Newbold, W.L. Carlson, B. Thorne, Prentice Hall.

**Week 2**

A university administrator requested a breakdown of travel expenses for faculty to attend various professional meetings. It was found that 31% of the travel expenses were spent for transportation costs, 25% for lodging, 17% for food, 20% for conference registration fees, and the remainder was spent for miscellaneous costs.

a. Construct a pie chart.

b. Construct a bar chart.

A company has determined that there are seven possible defects for one of its product lines. Construct a Pareto diagram for the following defect frequencies:

Defect Code Frequency

A 10

B 70

C 15

D 90

E 8

F 4

G 3

Construct a time series plot for the following number of customers shopping at a new mall during a given week.

DAY Number of Customers

Monday 525

Tuesday 540

Wednesday 469

Thursday 500

Friday 586

Saturday 640

Consider the following data:

17 28 39 39 40 59 12 62 51 41 32 21 13 54 15 24 35 36 44 44 64 65 65 15 37 37 56 59

a. Construct a frequency distribution.

b. Draw a histogram.

c. Draw an ogive.

d. Draw a stem-and-leaf display.

Three subcontractors, A, B, and C, supplied 58, 70, and 72 parts, respectively, to a plant during last week. Of the part supplied by subcontractor A, only four were defective. From the parts supplied by subcontractor B, 60 were good parts; from those supplied by subcontractor C, only six were defective.

a. Set up a cross table for the data.

b. Draw a bar chart.

Beijing Books offers discounted books only priced at $3, $5, and $10. The owner wants to know whether the price has any relationship with the number of days it takes for a customer decide on a purchase. The following data shows the price (X) and the number of days the book was on sale before it was sold (Y). The data is shown (X,Y) in pairs:

(3,7) (5,5) (10,2) (3,9) (5,6) (10,5) (3,6) (5,6)

(10,1) (3,10) (5,7) (10,4) (3,5) (5,6) (10,4)

A random sample of customers was asked to select their favorite soft drink from a list of five brands.

The results showed that 30 preferred brand A, 50 preferred brand B, 46 preferred brand C, 100 preferred brand D, and 14 preferred brand E.

a. Construct a bar chart.

b. Construct a pie chart.

What is the relationship between the price of paint and the demand for this paint? A random sample of (price, quantity) data for 7 days of operation was obtained. Prepare a plot and describe the relationship between quantity and price, with emphasis on any unusual observations.

(10, 100) (8, 120) (5, 200) (4, 200) (10, 90) (7, 110) (6, 150)

A supervisor of a plant kept records of the time (in seconds) that employees needed to complete a particular task. The data are summarized as follows:

Time 30<40 40<50 50<60 60<80 80<100 100<150

Number 10 15 20 30 24 20

a. Graph the data with a histogram.

b. Discuss possible errors.

**Describing Data Numerically**

Measures of Central Tendency

Measures of Variation

Shape of a Distribution

Measures of the Linear Relationship between two Variables

Measures of Central Tendency

We may construct a histogram to see if the data tend to center or cluster around some value.

Measures of central tendency provide numerical information about a "typical" observation in the data.

A parameter refers to a specific population characteristic; a statistic refers to a specific sample characteristic. Measures of central tendency can be computed for both.

Arithmetic Mean

The mean is the sum of the data values divided by the number of observations.

Median

The median is the middle observation of a set of observation of a set of observations that are arranged in increasing (or decreasing) order.

Mode

The mode, if one exists, the most frequently occurring value.

Geometric Mean

If you are interested in growth over a number of time periods use the geometric mean.

Quartiles

Measures of Variation

The mean alone does not provide a complete or sufficient description of data. While two data sets could have the same mean, the individual observations in one set could vary more from the mean than do the observations in the second data set.

Range

Interquartile Range

The IQR measures the spread in the middle 50% of the data; it is the difference between the third quartile and the first quartile.

Variance

We need a measure that would average the total distance between each of the data values and the mean.

But for all data sets, this sum will always equal to zero since the mean is the center of the data.

If each of this differences squared, then each observation contributes to the sum of the squared terms.

The average of the sum of the squared terms is called the variance

To compute the variance requires squaring the distances, which then changes the unit of measurement to square units!

Standard Deviation

Standard deviation is the positive square root of the variance.

Measures the average spread (variation) around the mean.

Has the same unit of measurement as the original data.

Most commonly used measure of variation.

To calculate the variance and the standard deviation,

Each value in the data set is used (sensitive to outliers)

Values far from the mean are given extra weight (because deviations from the mean are squared)

Coefficient of Variation

The coefficient of variation expresses the standard deviation as a percentage of the mean.

Shape of a Distribution

The shape of a distribution reveals whether data are evenly spread from its center. The shape of a distribution is said to be symmetric if the observations are balanced about its center.

If the distribution is symmetric then the mean is equal to the median and the distribution will have zero skewness.

If, in addition, the distribution is unimodal, then the mean = median = mode.

Skewness

A distribution is skewed, or asymmetric, if the observations are not symmetrically distributed on either side of the center.

Kurtosis

Linear Relationship between two Variables

Covariance and correlation are the numerical measures to describe a

linear

relationship between variables.

Covariance

Covariance (Cov) is a measure of the linear relationship.

A positive value indicates a direct or increasing relationship.

A negative value indicates a decreasing relationship.

The value of the covariance varies if a variable such as height is measured in feet or inches.

Does not provide a measure of the strength of the relationship.

Correlation Coefficient

A standardized measure of linear relationship (Unit free).

Provides both the direction and the strength of a linear relationship.

Computed by dividing the covariance by the product of standard deviations of the two variables.

Ranges between –1 and 1

The closer to –1, the stronger the negative linear relationship

The closer to 1, the stronger the positive linear relationship

The closer to 0, the weaker any positive linear relationship

Chebychev's Theorem

For any population with mean μ and standard deviation σ , and k > 1 , the percentage of observations that fall within the interval

[μ + kσ]

Is at least,

The Empirical Rule

Weighted Mean

Some situations require a special type of mean called weighted mean, i.e., calculating GPA, average stock recommendation, approximating the mean in group data.

Approximations for Grouped Data: Mean

Approximations for Grouped Data: Variance

Population Variance

The population variance is the sum of the squared differences between each observation and the population mean divided by the population size, N.

Sample Variance

The sample variance is the sum of the squared differences between each observation and the sample mean divided by the sample size, n, minus 1.

Population Standard Deviation

Sample Standard Deviation

Kurtosis measures both the "peakedness" of the distribution and the heaviness of its tail.

In comparison with normal distribution,

distributions with negative excess kurtosis are called platykurtic distributions.

distributions with positive excess kurtosis are called leptokurtic distributions.

A leptokurtic distribution has a more acute peak around the mean and fatter tails.

A platykurtic distribution has a lower, wider peak around the mean and thinner tails.

negative skew: The left tail is longer; the mass of the distribution is concentrated on the right of the figure. The distribution is said to be left-skewed, left-tailed, or skewed to the left.

positive skew: The right tail is longer; the mass of the distribution is concentrated on the left of the figure. The distribution is said to be right-skewed, right-tailed, or skewed to the right.

Covariance

Correlation

Direction

Strength & Direction

Textbooks and/or References

“Statistics for Business and Economics” by P. Newbold, W.L. Carlson, B. Thorne, Prentice Hall.

**EXERCISES**

**Week 3**

1. The time (in seconds) that a random sample of employees took to complete a task is as follows:

23 35 14 37 28 45 12 40 27 13 26 25 37 20 29 49

a. Find the mean, median and mode.

b. Find the standard deviation.

c. Find the coefficient of variation.

d. Find the IQR.

2. A random sample of data has a mean of 75 and a variance of 25.

a. Use Chebychev's theorem to determine the percent of observations between 65 and 85.

b. If the data are mounded, use the empirical rule to find approximate percent of observations between 65 and 85.

3. The annual percentage returns on common stocks over a 7-year period were as follows:

4.0% 14.3% 19.0% -14.7% -26.5% 37.2% 23.8%

Over the same period the annual percentage returns on U.S. Treasury bills were as follows:

6.5% 4.4% 3.8% 6.9% 8.0% 5.8% 5.1%

a. Compare the means of these two population distributions

b. Compare the standard deviations of these two population distributions.

4. Coffee shop customers were randomly surveyed and asked to select a category that described the cost of their recent purchase. The results were:

Cost (in USD) 0<2 2<4 4<6 6<8 8<10

# of Customers 2 3 6 5 4

Find the sample mean and the standard deviation of these costs.

5. A consumer goods company has been studying the effect of advertising on total profits. As part of this study, data on advertising expenditures (in 1000 USD) and total sales (in 1000 USD) were collected for a five month period and are as follows:

(10,100) (15,200) (7,80) (12, 120) (14, 150)

The first number is advertising expenditures and the second is total sales. Plot the data and compute the correlation coefficient. Briefly discuss the relationship.