### Present Remotely

Send the link below via email or IM

• Invited audience members will follow you as you navigate and present
• People invited to a presentation do not need a Prezi account
• This link expires 10 minutes after you close the presentation

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

You can change this under Settings & Account at any time.

# Data Management Old

No description
by

## Lukas Notten

on 6 September 2016

Report abuse

#### Transcript of Data Management Old

Data Management
Unit 4
Probability Distributions

Unit 3
Statistical Analysis

Unit 2 Organization of Data
MDM 4U
Mr. Notten

E. Culminating Investigation
Unit 1
Counting and Probability

1.0 Intro to Probability
The Monty Hall Problem
1.1 The Language of Probability
1.2
Visualizing Probability
A. Probability

B. Types of Probability

C. Probability Definitions

D. The Compliment Property
The Law of Large Numbers
1. Empirical (Experimental)
2. Theoretical
3. Subjective
1. Sample space (continuous and discrete)
2. Trial
3. Outcome
Applications: insurance, the lottery, casinos, the stock market
Review Concepts
1. Algebra and solving equations
2. Creating graphs (histograms, pie charts, bar graphs etc...)
3. Prime/composite numbers
4. Perfect squares

http://www.iflscience.com/health-and-medicine/herd-immunity-and-measles-why-we-should-aim-100-vaccination-coverage
Herd Immunity and the Measles
Herd Immunity: if a certain percentage of people in a population is immunized, the chance of an outbreak is much lower
Rubix Cubes
The Birthday Problem
As the number of trials gets very large, an experimental probability approaches the theoretical probability
1.3
Conditional Probability
A. Independent events

B. Independent events vs. Mutually Exclusive events. What's the difference?
B. Mutually Exclusive (Disjoint), non-mutually exclusive events
A. Outcome tables, probability trees, Venn diagrams
1.4
Independence
1.5
Learning How to Count
Factorials
Permutations

QUIZ
The Gambler's Fallacy
Monty Hall Problem Explained Using Conditional Probability
UNIT TEST
Explanation:
is the mistaken belief that, if something happens more frequently than normal during some period, it will happen less frequently in the future,

or that, if something happens less frequently than normal during some period, it will happen more frequently in the future

(presumably as a means of balancing nature).
Slot machines and casinos in general take advantage of this type of thinking.
http://en.wikipedia.org/wiki/Gambler%27s_fallacy
http://graphics.wsj.com/infectious-diseases-and-vaccines/#b02g20t20w15
Vaccination Stats:
1.7
Pascal's Triangle
Unit Project
Game Makers
The original (3×3×3) Rubik's Cube has eight corners and twelve edges.

There are 8! (40,320) ways to arrange the corner cubes. Seven can be oriented independently, and the orientation of the eighth depends on the preceding seven, giving 37 (2,187) possibilities.

There are 12!/2 (239,500,800) ways to arrange the edges, since an even permutation of the corners implies an even permutation of the edges as well. (When arrangements of centres are also permitted, as described below, the rule is that the combined arrangement of corners, edges, and centres must be an even permutation.)

Eleven edges can be flipped independently, with the flip of the twelfth depending on the preceding ones, giving 211 (2,048) possibilities
http://en.wikipedia.org/wiki/Rubik%27s_Cube
Permutations
4.0 That's So Random (?)
Terms:
probability, types (empirical, theoretical, subjective), compliment, law of large numbers, mutually exclusive, non-mutually exclusive, conditional probability, independent events
Calculate probabilities using:

Venn diagrams, tree diagrams, outcome tables, the compliment property, conditional probability, multiplicative rule for independent events
1.1-1.4 Quiz
Mutually Exclusive
Non-Mutually Exclusive
http://www.cbc.ca/news/lotteries-what-are-the-odds-1.775281
Winning the Lottery
Lottomax
Lotto 649
1.6
Does order matter?

Combinations
Odds
How many ways can you arrange a group of objects?
How many different groups can you select from a set of objects?
You must design a game that is:
Interesting and original
Easy to play
Profitable (for the people running the game) in "Data Dollars"
Probability not obvious for the player!
You can use dice, spinners, coins, random number generators, cards and more (but you are responsible for bringing it in).

We will be having a "Game Fair" the Friday AFTER March Break where we will all get a chance to play each others games. You will record all the outcomes of each trial for players playing your game.

You will also have to submit a written portion (due the following week) including:
Rules of the game (available to players at the Game Fair)
Theoretical probability of player winning
Player's expected return per game (probability of winning x winnings)
Summary of results from Game Fair
Comparison of actual and theoretical probabilities

6 numbers
3 letters followed by 3 numbers
3 numbers followed by 3 letters
4 letters followed by 3 numbers
3 numbers followed by a letter
Kenya
Ontario
France
2 letters followed by 3 numbers
followed by 2 letters
http://garsia.math.yorku.ca/~zabrocki/math5020f03/lot649/lot649v3.pdf
Postal Codes
Phone Numbers
How many total possible 7-digit phone numbers are there?
How many total possible postal codes are there?
Example: M9R 1V8
How many possible license plates are there?
[17,576,000]
[10,000,000]
[456,976,000]
How to get there?
Example: 555-2445
http://www.informationisbeautiful.net/

Remember that UPPERCASE letters are different from lowercase letters (for example, A is treated as different from a).
It must contain at least one character that is not a letter, such as a digit.
http://www.sussex.ac.uk/its/help/faq?faqid=839

Usernames can contain letters (a-z), numbers (0-9), dashes (-), underscores (_), apostrophes ('}, and periods (.).
Usernames can't contain an equal sign (=), brackets (<,>), plus sign (+), or more than one period (.) in a row.
Passwords can contain any combination of ASCII characters and must contain a minimum of 8 characters.
First and last names support unicode/UTF-8 characters, with a maximum of 60 characters.
Periods (.) are not ignored as they are in a gmail.com account. If you create a user account called username, this user will not be able to receive messages addressed to user.name, or us.er.na.me, or any other combination of periods. To let a user receive mail with these variations, create an email alias for them.
When creating a password you have the following characters which you can use:

numbers (10 different ones: 0-9)
letters (52 different ones: A-Z and a-z)
special characters (32 different ones).
1. 6 digit password with letters
2. 6 digit password with letters and numbers
3. 8 digit password with letters and numbers
4. 10 digit password with letters and numbers
5. 10 digit password with letters, numbers and special characters

Do these help?
http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen
An Introduction to Data
You will be getting a rubric tomorrow!
Gapminder TED Talk
How not to be ignorant about the world TED Talk
Information is Beautiful
http://www.thestar.com/yourtoronto/education/2015/03/05/boys-do-less-homework-than-girls-global-study-finds.html
2.1 Visualizing
Data
http://www.keepeek.com/Digital-Asset-Management/oecd/education/the-abc-of-gender-equality-in-education/tackling-underperformance-among-boys_9789264229945-5-en
Creating Data
Visualizations
1) Excel & Numbers
2) plot.ly
3) Gapminder World
discrete data
vs.
continuous data
qualitative data
vs.
quantitative data
primary data
vs.
secondary data
Girls do more homework than boys?
2.4 Reporting
on Data

2.5 Big Data
Toronto Star Article
Chapter 2 of Original OECD Study
The ABC of Gender Equality in Education: Tackling Underperformance Among Boys
According to this Toronto Star Article, yes they do...
2.1 Case Study
http://www.pewresearch.org/quiz/science-knowledge/
http://www.people-press.org/2013/04/22/publics-knowledge-of-science-and-technology/
What do you know about Science?
Pew Research Center Quiz
Pew Research Results
"Public's Knowledge of Science and Technology"
Below is a quiz written by Pew Research Centre to measure Scientific Literacy in the USA. Try it yourself and see how you compare!
2.3 Case Study
Pew Research Center Survey Methodology
http://www.pewresearch.org/methodology/u-s-survey-research/
What is Fair?
Fair Games:
player has an equal chance of winning or losing or all players have the same probability of winning

Expected value:
how much a player would expect to walk away from after playing a probability game
Case Study: 1972 Ford Pinto
2.3 Always go to the Source
2.2 What is Fair?
http://www.engineering.com/Library/ArticlesPage/tabid/85/ArticleID/166/Ford-Pinto.aspx
Utilitarianism: philosophy that states the best choice is the one that benefits the most people (maximizes utility)
Split-Bar Graphs
population
vs.
sample
census
vs.
survey
Some Sources of Data:
OECD
Pew Research
Sports leagues (NHL, NBA, MLB, MLS)
EQAO
Beware the Filter Bubble
http://blog.apastyle.org/apastyle/2013/12/how-to-cite-a-data-set-in-apa-style.html
Pew Hispanic Center. (2004). Changing channels and crisscrossing cultures:
A survey of Latinos on the news media [Data file and code book]. Retrieved
from http://pewhispanic.org/datasets/
The in-text citation would be "Pew Hispanic Center (2004)" or "(Pew Hispanic Center, 2004)."
Referencing Data in APA
format of data
1. What surprised you most about the article?

5. Take out the marking scheme for your 2.4 Assignment. How do you think this article would score on the assignment?

6. Name 5 things that may indicate this article may be using unreliable data or may be purposefully misrepresenting data.

7. Put a star (*) everywhere in the article you think there should be a in-text citation.

Discussion questions
More Discussion Questions
"Scientists Blow The Lid on Cancer & Sunscreen Myth"
http://theunboundedspirit.com/scientists-blow-the-lid-on-cancer-sunscreen-myth/
Correlation vs. Causation
Does correlation imply causation?
In your group, choose 2 quantitative variables you think have a correlation. You will need to collect the data on the students in the class and create a scatter plot with the data. Here are some example of variables...
• Height
• Arm span
• Shoe size
• Foot length
• Hours spent watching TV
• Hours spent looking at a screen
• Number of siblings
• Number of electronic devices owned

• Snapchat score
• Number of tweets
• Number of phone contacts
• Hair length
• Number of courses taken this semester
• Number of math courses taken this year
Or come up with your own!
1) Choose your variables, identify the independent and dependent variable
2) Collect data from the class
3) Plot the data, draw a line of best fit, describe the correlation
4) Is this a causal relationship? [Does the independent variable affect the dependent variable?]
Instructions...
Terms:
causal relationship
common-cause factor
spurious correlation
Google Ngram Viewer - Appearance of Words in Books
Do cell phones cause cancer?
http://tylervigen.com/
http://www.eqao.com/emagazine/2008/05/eMagArticle.aspx?Lang=E&ArticleID=08&ItemID=23
piktochart.com
prezi
10 awesome sites
powerpoint
Creating infographics
1. Describe the correlation in the scatter plot above (linear, non-linear, strong, weak, positive, negative)

2. Do you think this is a causal relationship?
If so, explain why you think so.
If not, identify a common-cause factor

3. Is is the data in this graph easy to understand or is it misleading? What would you change about it?
Respond to the following...
What are the limitations of 'small data' and how can we overcome them?
A) What is 'small data'?
B) This is why you shouldn’t believe that exciting new medical study
http://www.vox.com/2015/3/23/8264355/research-study-hype
http://www.informationisbeautiful.net/play/snake-oil-supplements/
Explore...
C) What is meta-analysis and why is it useful?
crime
education
social media
consumer goods
health
environment
D) What is big data and why is it useful?

Homework: Make a list of anything that is regularly collecting data on you (phone GPS, apps, itunes, websites, retail locations).
Can produce false positive results or over exaggerated results that lack accuracy
Not enough research. Small data = small numbers
Small amount of data entry may be insufficient to obtain a concrete trend
The limitations of small data may not be accurate as it doesn't generalize to a larger population. The population is broad
Classification problems
Only one case study. Not many factors studied
data cannot be always allocated accurately to a given industry and therefore it becomes necessary to assign these data to an additional industry
Influences researchers cannot control
Environmental factors can alter results therefore creating inaccuracy
small number of data points (10, 100, 1000)
data reported in individual studies
analysis of a large quantity of studies identifying trends in the studies themselves
Topics
Part 2: Find correlation [individual]
use your data hunting skills to find correlation between data in the chosen issue
data must be from a reliable source
data must be approved by teacher
student will submit a scatter plot along with description of correlation
due Monday
Part 3: Compile data [topic group]
compile each team members' data into a common Google Spreadsheet so all members of the group can access and edit the data at any time
Part 4: Create a infographic [topic group]
the topic group will work together to create an infographic based on the data they have collected in parts 2 and 3
Part 1: The issue [topic group]
Create an infographic on a current issue
decide what topic most interests you and meet with the rest of the topic group
decide on a current issue that you will all search for data on
issue is to be approved by teacher
due Friday
2.5 Assignment
Some big data resources
IHME (Institute for Health Metrics and Evaluation)
http://www.healthdata.org/
Gapminder
http://www.gapminder.org/
crime rate in Etobicoke
effect of crime rate on education
new sex ed curriculum
EQAO and Literacy Test
effect of social media on school marks
compare different devices/apps
effect of screen time on school marks
obesity rates
alternative energy
electric cars
Fraser Institute
http://ontario.compareschoolrankings.org/SchoolsByRankLocationName.aspx?schoolType=secondary&SortBy=RankThisYear
Issues
Toronto Crime Rates
http://www.cbc.ca/toronto/features/crimemap/
More...
EQAO, Stats Canada, Pew Research etc.
Choosing a topic
Choosing variables
Finding data
http://www.makeuseof.com/tag/awesome-free-tools-infographics/
Complete and show examples
Sharing data
Sampling Methods
(2.3 in text)
Simple random
[equally likely to choose anybody]
Systematic random
[survey every nth person]
Stratified random
[divide population into groups and randomly sample the groups]
Cluster random
[divide population into groups and do a census of randomly selected groups]
Multi-stage random
[divide population into groups and do a simple random sample of randomly selected groups]
Destructive
[non-repeatable, samples require destruction of subject, not done on people]

Survey Questions
(2.4 in text)
Types:
Open Questions
Closed Questions
Information questions
Checklist questions
Ranking questions
Rating questions

Avoid...
1) avoid jargon
2) avoid abbreviations
3) negatives
5) insensitivity
Survey questions should be...
Bias in Surveys
(2.5 in text)
Sampling bias
[chosen sample does not represent population]
Non-response bias
[surveys not returned, skewing results]
Household bias
[overrepresentation of one group of respondants]
Response bias
[factors in the sampling method affecting the results]
Types of bias
Avoiding Bias
1) stratified random sampling
Discussion questions...
1. What type of sampling is used in this survey?
2. Is the data collected quantitative or qualitative? Discrete or continuous?
3. Identify the question type for each question. Would you have changed any of the types of questions?
4. Why do I ask for your name at the beginning of the survey? Is this a good idea?
5. Was the survey biased in any way? What types of bias were there and how would you fix it?
6. Why is it common for a survey to ask questions about race, age or gender?
Mr Notten's Survey
Non-random
[not everyone has an equal chance of being selected, or the group is handpicked]
Census
[everyone in a population is sampled]
Why do we usually try to do random sampling?
What is the issue with non-random sampling?
Do you think most surveys done are random or non-random?
Your survey questions will be evaluated on this criteria.
How can we avoid bias when we do a survey?
Creating a survey...
Tools:
Survey Monkey (surveymonkey.com)
Any other online survey platform
Paper survey
[effective if you want to target a specific group you can physically meet with]
now it's time for you to collect some of your own primary data!
Big Data and Social Justice
http://news.nationalpost.com/toronto/torontos-new-police-chief-wont-abolish-controversial-practice-of-carding-there-will-be-an-increase-in-crime
Carding in Toronto
3.1 Central Tendency
A. Mean, Median and Mode
B. Central Tendency
How can we describe a set of data based on its distribution?
LEFT SKEW
RIGHT SKEW
Weighted Mean
Uniform
U-Shaped
Skewed
Mound-Shaped
B. Standard Deviation
3.4 The Normal Distribution
C. Characteristics of Normal Distributions
B. Where can we find normal distributions?
3.5 Applying Normal Distributions
A. Z-Scores
B. Percentiles

How can we use normal distributions to understand data?
Unit 3 Quiz
4.2 Binomial Distributions
The Digits of Pi
4.4 Hypergeometric Distributions
4.3 Geometric Distributions
3.2 Creating Histograms
A. Creating Histograms
B. Interpreting Histograms

C. Tools for creating histograms
How can we create and interpret histograms?
When are mean, median and mode useful?
Steps:
choose a quantitative, continuous variable
collect data
determine range of data (highest - lowest)
choose appropriate number of intervals (so bin width is easy to work with)
calculate a bin width (range/number of intervals)
make sure no values lie between intervals
count frequency for each bin
scale y-axis and plot data
label the mean, median and mode on the histogram
https://plot.ly/~cimar/214/_2013-nhl-player-height/
Plotly
using frequency()
Collect some data from the class and make a histogram
Why is it important to look at the spread of data?
vs.
Remember, histograms are used for quantitative, continuous variables
height
shoe length
hair length
finger span
arm span
money in wallet
oldest coin in pocket/wallet
Variance
Find the IQR by:
1) Order the data and find the median (Q2)
2) Find the median of each half (Q1 and Q3)
3) IQR = Q3-Q1
A. Interquartile Range

Homework: Pg 168 #1-6, do 7-8 with technology!
a measure of spread based around the median
commonly shown in box and whisker plots
Box and Whisker Plot
and Variance
a measure of spread around the mean
the higher the standard deviation, the further away from the mean the data is spread
Calculate the standard deviation of a set of data by hand (easiest done in a table):
1) Calculate the mean of the data
2) Calculate the difference between each value and the mean
3) Calculate the square of the difference for each value
4) Sum the squares
5) Plug the data into the standard deviation formula
http://www.mathsisfun.com/data/quincunx.html
the square of the standard deviation
we will use standard deviation more in this unit
Calculating Standard Deviation Using Technology
gender
sex
biological
psychological
independent of sexual orientation
(LGBT)
vs
A. What is the normal distribution?
modeled by:
found in a lot of seemingly random real world observations
we represent patterns in random events using probability distributions: the normal distribution is one example
used as an approximation in statistical data
this isn't always a good assumption
a probability distribution (connect the tops of bars in a histogram) that forms a bell shape
useful because it is relatively easy to model
notation used in textbook: X~N(x, )
example X~N(5,2.3 )
One person is a flipper, one person is a counter
1) Flip a coin 15 times
2) Record how many times heads came up
3) Repeat steps 1 and 2 three more times
4) Tell the teacher how many times heads came up in each set of 15 flips
In pairs...
_
2
symmetrical (mean, median and mode all the same)
bell shaped, approaching 0 at the extremes
68% of data within 1 SD
95% of data within 2 SD
99.7% of data with 3 SD
y-axis usually representsthe probability
total area under the curve = 1
4.1 Probability Distributions
Random variable = X
variable subject to chance
Discrete Random variable = X
variable that assumes a unique value for each outcome
Probability Distribution of a Discrete Random Variable
Probability Distribution of a Continuous Random Variable
Calculate the probability of each outcome for the sum when rolling a pair of dice.
Is the sum a discrete or continuous variable?
What is the expected value?
Calculating Expected Value
P(X=x)
Note:
the digits of pi are considered pseudorandom.
although they follow the pattern of a uniform distribution, the order in which the digits appear don't change
Continuous
Discrete
Continuous Uniform Distribution
Normal Distribution
all outcomes have the same probability
bell curve
seen in randomly distributed natural phenomena
Exponential Distribution
bell curve
seen in randomly distributed natural phenomena
Discrete Uniform Distribution
all outcomes have the same probability
dice roll, coin flip, digits of pi
Binomial Distribution
probability of getting a certain number of successes in a Bernoulli trial
approximates normal distribution
how many heads in 15 coin flips?
Geometric Distribution
number of trials needed to get a success in a Bernoulli trial
approximates exponential distribution
how many times do you have to roll a die before you get a 6
Hypergeometric Distribution
number of successes in a Bernoulli trial with no replacement
Bernoulli Trials
experiment where there are 2 outcomes: success and failure
probability of success = p
probability of failure = 1 - p
trials are independent
Back to the Quincrux
Questions:
1) Name 5 examples of Bernoulli Trials
2) What is p for a coin flip
3) What is 1-p for a coin flip
4) What is the probability of getting all heads with 5 flips
5) What is the probability of getting 3 heads with 5 flips?
6) What is the probability of getting 7 heads in 15 flips?
http://www.mathsisfun.com/data/quincunx.html
Some important distributions...
E(X) = np
P(X = x) = (1-p) p
x
Probability of k successes in r trials with removal
Probability of x failures before a success
Probability of k successes in n trials
Full transcript