**Data Management**

**Unit 4**

Probability Distributions

Probability Distributions

**Unit 3**

Statistical Analysis

Statistical Analysis

**Unit 2 Organization of Data**

**MDM 4U**

Mr. Notten

Mr. Notten

**E. Culminating Investigation**

**Unit 1**

Counting and Probability

Counting and Probability

1.0 Intro to Probability

The Monty Hall Problem

1.1 The Language of Probability

1.2

Visualizing Probability

A. Probability

B. Types of Probability

C. Probability Definitions

D. The Compliment Property

The Law of Large Numbers

1. Empirical (Experimental)

2. Theoretical

3. Subjective

1. Sample space (continuous and discrete)

2. Trial

3. Outcome

Applications: insurance, the lottery, casinos, the stock market

**Review Concepts**

**1. Algebra and solving equations**

2. Creating graphs (histograms, pie charts, bar graphs etc...)

3. Prime/composite numbers

4. Perfect squares

2. Creating graphs (histograms, pie charts, bar graphs etc...)

3. Prime/composite numbers

4. Perfect squares

http://www.iflscience.com/health-and-medicine/herd-immunity-and-measles-why-we-should-aim-100-vaccination-coverage

Herd Immunity and the Measles

Herd Immunity: if a certain percentage of people in a population is immunized, the chance of an outbreak is much lower

Rubix Cubes

The Birthday Problem

As the number of trials gets very large, an experimental probability approaches the theoretical probability

1.3

Conditional Probability

A. Independent events

B. Independent events vs. Mutually Exclusive events. What's the difference?

B. Mutually Exclusive (Disjoint), non-mutually exclusive events

A. Outcome tables, probability trees, Venn diagrams

1.4

Independence

1.5

Learning How to Count

Factorials

Permutations

**QUIZ**

The Gambler's Fallacy

Monty Hall Problem Explained Using Conditional Probability

**UNIT TEST**

Explanation:

is the mistaken belief that, if something happens more frequently than normal during some period, it will happen less frequently in the future,

or that, if something happens less frequently than normal during some period, it will happen more frequently in the future

(presumably as a means of balancing nature).

Slot machines and casinos in general take advantage of this type of thinking.

http://en.wikipedia.org/wiki/Gambler%27s_fallacy

http://graphics.wsj.com/infectious-diseases-and-vaccines/#b02g20t20w15

Vaccination Stats:

1.7

Pascal's Triangle

Unit Project

Game Makers

The original (3×3×3) Rubik's Cube has eight corners and twelve edges.

There are 8! (40,320) ways to arrange the corner cubes. Seven can be oriented independently, and the orientation of the eighth depends on the preceding seven, giving 37 (2,187) possibilities.

There are 12!/2 (239,500,800) ways to arrange the edges, since an even permutation of the corners implies an even permutation of the edges as well. (When arrangements of centres are also permitted, as described below, the rule is that the combined arrangement of corners, edges, and centres must be an even permutation.)

Eleven edges can be flipped independently, with the flip of the twelfth depending on the preceding ones, giving 211 (2,048) possibilities

http://en.wikipedia.org/wiki/Rubik%27s_Cube

Permutations

4.0 That's So Random (?)

Terms:

probability, types (empirical, theoretical, subjective), compliment, law of large numbers, mutually exclusive, non-mutually exclusive, conditional probability, independent events

Calculate probabilities using:

Venn diagrams, tree diagrams, outcome tables, the compliment property, conditional probability, multiplicative rule for independent events

1.1-1.4 Quiz

Mutually Exclusive

Non-Mutually Exclusive

http://www.cbc.ca/news/lotteries-what-are-the-odds-1.775281

Winning the Lottery

Lottomax

Lotto 649

1.6

Does order matter?

Combinations

Odds

How many ways can you arrange a group of objects?

How many different groups can you select from a set of objects?

You must design a game that is:

Interesting and original

Easy to play

Profitable (for the people running the game) in "Data Dollars"

Probability not obvious for the player!

You can use dice, spinners, coins, random number generators, cards and more (but you are responsible for bringing it in).

We will be having a "Game Fair" the Friday AFTER March Break where we will all get a chance to play each others games. You will record all the outcomes of each trial for players playing your game.

You will also have to submit a written portion (due the following week) including:

Rules of the game (available to players at the Game Fair)

Theoretical probability of player winning

Player's expected return per game (probability of winning x winnings)

Summary of results from Game Fair

Comparison of actual and theoretical probabilities

6 numbers

3 letters followed by 3 numbers

3 numbers followed by 3 letters

4 letters followed by 3 numbers

3 numbers followed by a letter

Kenya

Ontario

France

2 letters followed by 3 numbers

followed by 2 letters

http://garsia.math.yorku.ca/~zabrocki/math5020f03/lot649/lot649v3.pdf

License Plates

Postal Codes

Phone Numbers

How many total possible 7-digit phone numbers are there?

How many total possible postal codes are there?

Example: M9R 1V8

How many possible license plates are there?

[17,576,000]

[10,000,000]

[456,976,000]

Other license plates

How to get there?

Example: 555-2445

http://www.informationisbeautiful.net/

https://howsecureismypassword.net/

How Secure is My Password?

There are some simple rules that you must follow when changing your password:

Your password must be exactly 8 characters long.

It must start with a letter.

Remember that UPPERCASE letters are different from lowercase letters (for example, A is treated as different from a).

It must contain at least one character that is not a letter, such as a digit.

http://www.sussex.ac.uk/its/help/faq?faqid=839

How many possible passwords?

When choosing usernames and passwords for users and groups you add to your Google for Work account, consider the following:

Usernames can contain letters (a-z), numbers (0-9), dashes (-), underscores (_), apostrophes ('}, and periods (.).

Usernames can't contain an equal sign (=), brackets (<,>), plus sign (+), or more than one period (.) in a row.

Passwords can contain any combination of ASCII characters and must contain a minimum of 8 characters.

First and last names support unicode/UTF-8 characters, with a maximum of 60 characters.

Periods (.) are not ignored as they are in a gmail.com account. If you create a user account called username, this user will not be able to receive messages addressed to user.name, or us.er.na.me, or any other combination of periods. To let a user receive mail with these variations, create an email alias for them.

https://support.google.com/a/answer/33386?hl=en

When creating a password you have the following characters which you can use:

numbers (10 different ones: 0-9)

letters (52 different ones: A-Z and a-z)

special characters (32 different ones).

1. 6 digit password with letters

2. 6 digit password with letters and numbers

3. 8 digit password with letters and numbers

4. 10 digit password with letters and numbers

5. 10 digit password with letters, numbers and special characters

Do these help?

How many possible passwords?

http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen

http://www.ted.com/talks/hans_and_ola_rosling_how_not_to_be_ignorant_about_the_world#t-145672

**An Introduction to Data**

You will be getting a rubric tomorrow!

Gapminder TED Talk

How not to be ignorant about the world TED Talk

Facebook Connections

Information is Beautiful

http://www.thestar.com/yourtoronto/education/2015/03/05/boys-do-less-homework-than-girls-global-study-finds.html

2.1 Visualizing

Data

http://www.keepeek.com/Digital-Asset-Management/oecd/education/the-abc-of-gender-equality-in-education/tackling-underperformance-among-boys_9789264229945-5-en

Creating Data

Visualizations

1) Excel & Numbers

2) plot.ly

3) Gapminder World

discrete data

vs.

continuous data

qualitative data

vs.

quantitative data

primary data

vs.

secondary data

Girls do more homework than boys?

**2.4 Reporting**

on Data

on Data

**2.5 Big Data**

**2.6 Your Data**

Toronto Star Article

Chapter 2 of Original OECD Study

The ABC of Gender Equality in Education: Tackling Underperformance Among Boys

According to this Toronto Star Article, yes they do...

2.1 Case Study

http://www.pewresearch.org/quiz/science-knowledge/

http://www.people-press.org/2013/04/22/publics-knowledge-of-science-and-technology/

What do you know about Science?

Pew Research Center Quiz

Pew Research Results

"Public's Knowledge of Science and Technology"

Below is a quiz written by Pew Research Centre to measure Scientific Literacy in the USA. Try it yourself and see how you compare!

2.3 Case Study

Pew Research Center Survey Methodology

http://www.pewresearch.org/methodology/u-s-survey-research/

What is Fair?

Fair Games:

player has an equal chance of winning or losing or all players have the same probability of winning

Expected value:

how much a player would expect to walk away from after playing a probability game

Case Study: 1972 Ford Pinto

and the business of probability

**2.3 Always go to the Source**

**2.2 What is Fair?**

http://www.engineering.com/Library/ArticlesPage/tabid/85/ArticleID/166/Ford-Pinto.aspx

Utilitarianism: philosophy that states the best choice is the one that benefits the most people (maximizes utility)

Split-Bar Graphs

population

vs.

sample

census

vs.

survey

Some Sources of Data:

Statistics Canada

OECD

Pew Research

Sports leagues (NHL, NBA, MLB, MLS)

EQAO

Beware the Filter Bubble

http://blog.apastyle.org/apastyle/2013/12/how-to-cite-a-data-set-in-apa-style.html

Pew Hispanic Center. (2004). Changing channels and crisscrossing cultures:

A survey of Latinos on the news media [Data file and code book]. Retrieved

from http://pewhispanic.org/datasets/

The in-text citation would be "Pew Hispanic Center (2004)" or "(Pew Hispanic Center, 2004)."

Referencing Data in APA

format of data

1. What surprised you most about the article?

2. Was the headline successful in catching your attention?

3. Has reading this article changed way you think about sunbathing? Why or why not?

4. Why do you think Mr. Notten chose this article for you to read?

5. Take out the marking scheme for your 2.4 Assignment. How do you think this article would score on the assignment?

6. Name 5 things that may indicate this article may be using unreliable data or may be purposefully misrepresenting data.

7. Put a star (*) everywhere in the article you think there should be a in-text citation.

8. What can we learn from this article about reporting on data?

Discussion questions

More Discussion Questions

"Scientists Blow The Lid on Cancer & Sunscreen Myth"

http://theunboundedspirit.com/scientists-blow-the-lid-on-cancer-sunscreen-myth/

Correlation vs. Causation

Does correlation imply causation?

In your group, choose 2 quantitative variables you think have a correlation. You will need to collect the data on the students in the class and create a scatter plot with the data. Here are some example of variables...

• Height

• Arm span

• Shoe size

• Foot length

• Hours spent watching TV

• Hours spent looking at a screen

• Number of siblings

• Number of electronic devices owned

• Number of Facebook friends

• Snapchat score

• Number of tweets

• Number of twitter followers

• Number of phone contacts

• Hair length

• Number of courses taken this semester

• Number of math courses taken this year

Or come up with your own!

1) Choose your variables, identify the independent and dependent variable

2) Collect data from the class

3) Plot the data, draw a line of best fit, describe the correlation

4) Is this a causal relationship? [Does the independent variable affect the dependent variable?]

Instructions...

Terms:

causal relationship

common-cause factor

spurious correlation

https://www.google.ca/trends/explore#q=%2Fm%2F06__v%2C%20%2Fm%2F019w40&cmpt=q

Google Trends - Search Frequency

https://books.google.com/ngrams

Google Ngram Viewer - Appearance of Words in Books

The Big Data of Google

http://www.informationisbeautiful.net/visualizations/google-ngram-experiments/

Do cell phones cause cancer?

http://tylervigen.com/

http://www.eqao.com/emagazine/2008/05/eMagArticle.aspx?Lang=E&ArticleID=08&ItemID=23

piktochart.com

prezi

10 awesome sites

powerpoint

Creating infographics

1. Describe the correlation in the scatter plot above (linear, non-linear, strong, weak, positive, negative)

2. Do you think this is a causal relationship?

If so, explain why you think so.

If not, identify a common-cause factor

3. Is is the data in this graph easy to understand or is it misleading? What would you change about it?

Respond to the following...

What are the limitations of 'small data' and how can we overcome them?

A) What is 'small data'?

B) This is why you shouldn’t believe that exciting new medical study

http://www.vox.com/2015/3/23/8264355/research-study-hype

http://www.informationisbeautiful.net/play/snake-oil-supplements/

Explore...

C) What is meta-analysis and why is it useful?

crime

education

social media

consumer goods

health

environment

D) What is big data and why is it useful?

Google search, maps, image search

Facebook, youtube, twitter, instagram

Google maps...

Homework: Make a list of anything that is regularly collecting data on you (phone GPS, apps, itunes, websites, retail locations).

Can produce false positive results or over exaggerated results that lack accuracy

Not enough research. Small data = small numbers

Small amount of data entry may be insufficient to obtain a concrete trend

The limitations of small data may not be accurate as it doesn't generalize to a larger population. The population is broad

Classification problems

Only one case study. Not many factors studied

data cannot be always allocated accurately to a given industry and therefore it becomes necessary to assign these data to an additional industry

Influences researchers cannot control

Environmental factors can alter results therefore creating inaccuracy

Your responses...

small number of data points (10, 100, 1000)

data reported in individual studies

analysis of a large quantity of studies identifying trends in the studies themselves

Topics

http://www.cbc.ca/news/canada/toronto/end-streaming-in-ontario-high-schools-report-urges-1.3030195

http://www.cbc.ca/news/canada/toronto/sex-ed-curriculum-changes-protested-by-thousands-at-queen-s-park-1.3032264

Task:

Part 2: Find correlation [individual]

use your data hunting skills to find correlation between data in the chosen issue

data must be from a reliable source

data must be approved by teacher

student will submit a scatter plot along with description of correlation

due Monday

Part 3: Compile data [topic group]

compile each team members' data into a common Google Spreadsheet so all members of the group can access and edit the data at any time

Part 4: Create a infographic [topic group]

the topic group will work together to create an infographic based on the data they have collected in parts 2 and 3

Part 1: The issue [topic group]

Create an infographic on a current issue

decide what topic most interests you and meet with the rest of the topic group

decide on a current issue that you will all search for data on

issue is to be approved by teacher

due Friday

**2.5 Assignment**

Some big data resources

IHME (Institute for Health Metrics and Evaluation)

http://www.healthdata.org/

Gapminder

http://www.gapminder.org/

crime rate in Etobicoke

effect of crime rate on education

new sex ed curriculum

streaming in grade 9 (applied and academic)

EQAO and Literacy Test

effect of social media on school marks

compare different devices/apps

effect of screen time on school marks

obesity rates

alternative energy

electric cars

Fraser Institute

http://ontario.compareschoolrankings.org/SchoolsByRankLocationName.aspx?schoolType=secondary&SortBy=RankThisYear

Issues

Toronto Crime Rates

http://www.cbc.ca/toronto/features/crimemap/

More...

EQAO, Stats Canada, Pew Research etc.

Google Trends

Google nGram Viewer

Choosing a topic

Choosing variables

Finding data

http://www.makeuseof.com/tag/awesome-free-tools-infographics/

Complete and show examples

Google Docs

Sharing data

http://en.wikipedia.org/wiki/Misleading_graph

Sampling Methods

(2.3 in text)

Simple random

[equally likely to choose anybody]

Systematic random

[survey every nth person]

Stratified random

[divide population into groups and randomly sample the groups]

Cluster random

[divide population into groups and do a census of randomly selected groups]

Multi-stage random

[divide population into groups and do a simple random sample of randomly selected groups]

Destructive

[non-repeatable, samples require destruction of subject, not done on people]

Survey Questions

(2.4 in text)

Types:

Open Questions

Closed Questions

Information questions

Checklist questions

Ranking questions

Rating questions

simple, relevant, specific and readable.

Avoid...

1) avoid jargon

2) avoid abbreviations

3) negatives

4) leading questions

5) insensitivity

Survey questions should be...

Bias in Surveys

(2.5 in text)

Sampling bias

[chosen sample does not represent population]

Non-response bias

[surveys not returned, skewing results]

Household bias

[overrepresentation of one group of respondants]

Response bias

[factors in the sampling method affecting the results]

Types of bias

Avoiding Bias

1) stratified random sampling

Discussion questions...

1. What type of sampling is used in this survey?

2. Is the data collected quantitative or qualitative? Discrete or continuous?

3. Identify the question type for each question. Would you have changed any of the types of questions?

4. Why do I ask for your name at the beginning of the survey? Is this a good idea?

5. Was the survey biased in any way? What types of bias were there and how would you fix it?

6. Why is it common for a survey to ask questions about race, age or gender?

Mr Notten's Survey

https://docs.google.com/forms/d/1Sjh9gRfYsdNg3BY6ytAOHVBX_Emms7f7TiHLtpZOuvc/viewform

Non-random

[not everyone has an equal chance of being selected, or the group is handpicked]

Census

[everyone in a population is sampled]

Why do we usually try to do random sampling?

What is the issue with non-random sampling?

Do you think most surveys done are random or non-random?

Your survey questions will be evaluated on this criteria.

How can we avoid bias when we do a survey?

Creating a survey...

Tools:

Google Forms (part of Google Docs)

[requires free Google login]

Survey Monkey (surveymonkey.com)

[requires you make a free account]

Any other online survey platform

Links to surveys can be shared on Twitter, Facebook etc...

Paper survey

[effective if you want to target a specific group you can physically meet with]

now it's time for you to collect some of your own primary data!

Big Data and Social Justice

http://www.washingtonpost.com/blogs/wonkblog/wp/2015/04/28/the-most-racist-places-in-america-according-to-google/

http://news.nationalpost.com/toronto/torontos-new-police-chief-wont-abolish-controversial-practice-of-carding-there-will-be-an-increase-in-crime

Carding in Toronto

**3.1 Central Tendency**

A. Mean, Median and Mode

B. Central Tendency

How can we describe a set of data based on its distribution?

LEFT SKEW

RIGHT SKEW

Weighted Mean

Uniform

U-Shaped

Skewed

Mound-Shaped

**3.3 Measures of Spread**

B. Standard Deviation

**3.4 The Normal Distribution**

C. Characteristics of Normal Distributions

B. Where can we find normal distributions?

**3.5 Applying Normal Distributions**

**A. Z-Scores**

B. Percentiles

B. Percentiles

How can we use normal distributions to understand data?

**Unit 3 Quiz**

4.2 Binomial Distributions

The Digits of Pi

4.4 Hypergeometric Distributions

4.3 Geometric Distributions

**3.2 Creating Histograms**

A. Creating Histograms

B. Interpreting Histograms

C. Tools for creating histograms

How can we create and interpret histograms?

When are mean, median and mode useful?

Steps:

choose a quantitative, continuous variable

collect data

determine range of data (highest - lowest)

choose appropriate number of intervals (so bin width is easy to work with)

calculate a bin width (range/number of intervals)

make sure no values lie between intervals

count frequency for each bin

scale y-axis and plot data

label the mean, median and mode on the histogram

https://plot.ly/~cimar/214/_2013-nhl-player-height/

Plotly

Google Sheets

using frequency()

more advanced

**Your Task...**

Collect some data from the class and make a histogram

Why is it important to look at the spread of data?

vs.

Remember, histograms are used for quantitative, continuous variables

height

shoe length

hair length

finger span

head circumference

arm span

money in wallet

oldest coin in pocket/wallet

Variance

Find the IQR by:

1) Order the data and find the median (Q2)

2) Find the median of each half (Q1 and Q3)

3) IQR = Q3-Q1

A. Interquartile Range

Task: calculate the IQR, standard deviation and variance for your graph

Homework: Pg 168 #1-6, do 7-8 with technology!

a measure of spread based around the median

commonly shown in box and whisker plots

Box and Whisker Plot

and Variance

a measure of spread around the mean

the higher the standard deviation, the further away from the mean the data is spread

Calculate the standard deviation of a set of data by hand (easiest done in a table):

1) Calculate the mean of the data

2) Calculate the difference between each value and the mean

3) Calculate the square of the difference for each value

4) Sum the squares

5) Plug the data into the standard deviation formula

http://www.mathsisfun.com/data/quincunx.html

the square of the standard deviation

another measure of spread

we will use standard deviation more in this unit

Calculating Standard Deviation Using Technology

**gender**

**sex**

biological

psychological

independent of sexual orientation

(LGBT)

How should we ask about gender in our studies?

vs

A. What is the normal distribution?

modeled by:

found in a lot of seemingly random real world observations

we represent patterns in random events using probability distributions: the normal distribution is one example

used as an approximation in statistical data

this isn't always a good assumption

a probability distribution (connect the tops of bars in a histogram) that forms a bell shape

useful because it is relatively easy to model

notation used in textbook: X~N(x, )

example X~N(5,2.3 )

One person is a flipper, one person is a counter

1) Flip a coin 15 times

2) Record how many times heads came up

3) Repeat steps 1 and 2 three more times

4) Tell the teacher how many times heads came up in each set of 15 flips

In pairs...

_

2

symmetrical (mean, median and mode all the same)

bell shaped, approaching 0 at the extremes

68% of data within 1 SD

95% of data within 2 SD

99.7% of data with 3 SD

y-axis usually representsthe probability

total area under the curve = 1

4.1 Probability Distributions

Random variable = X

variable subject to chance

Discrete Random variable = X

variable that assumes a unique value for each outcome

Probability Distribution of a Discrete Random Variable

Probability Distribution of a Continuous Random Variable

Calculate the probability of each outcome for the sum when rolling a pair of dice.

Is the sum a discrete or continuous variable?

What is the expected value?

Calculating Expected Value

P(X=x)

Note:

the digits of pi are considered pseudorandom.

although they follow the pattern of a uniform distribution, the order in which the digits appear don't change

Continuous

Discrete

Continuous Uniform Distribution

Normal Distribution

all outcomes have the same probability

bell curve

seen in randomly distributed natural phenomena

Exponential Distribution

bell curve

seen in randomly distributed natural phenomena

Discrete Uniform Distribution

all outcomes have the same probability

dice roll, coin flip, digits of pi

Binomial Distribution

probability of getting a certain number of successes in a Bernoulli trial

approximates normal distribution

how many heads in 15 coin flips?

Geometric Distribution

number of trials needed to get a success in a Bernoulli trial

approximates exponential distribution

how many times do you have to roll a die before you get a 6

Hypergeometric Distribution

number of successes in a Bernoulli trial with no replacement

Bernoulli Trials

experiment where there are 2 outcomes: success and failure

probability of success = p

probability of failure = 1 - p

trials are independent

Back to the Quincrux

Questions:

1) Name 5 examples of Bernoulli Trials

2) What is p for a coin flip

3) What is 1-p for a coin flip

4) What is the probability of getting all heads with 5 flips

5) What is the probability of getting 3 heads with 5 flips?

6) What is the probability of getting 7 heads in 15 flips?

http://www.mathsisfun.com/data/quincunx.html

Some important distributions...

E(X) = np

P(X = x) = (1-p) p

x

Probability of k successes in r trials with removal

Probability of x failures before a success

Probability of k successes in n trials