Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Data Management Old
Transcript of Data Management Old
Unit 2 Organization of Data
E. Culminating Investigation
Counting and Probability
1.0 Intro to Probability
The Monty Hall Problem
1.1 The Language of Probability
B. Types of Probability
C. Probability Definitions
D. The Compliment Property
The Law of Large Numbers
1. Empirical (Experimental)
1. Sample space (continuous and discrete)
Applications: insurance, the lottery, casinos, the stock market
1. Algebra and solving equations
2. Creating graphs (histograms, pie charts, bar graphs etc...)
3. Prime/composite numbers
4. Perfect squares
Herd Immunity and the Measles
Herd Immunity: if a certain percentage of people in a population is immunized, the chance of an outbreak is much lower
The Birthday Problem
As the number of trials gets very large, an experimental probability approaches the theoretical probability
A. Independent events
B. Independent events vs. Mutually Exclusive events. What's the difference?
B. Mutually Exclusive (Disjoint), non-mutually exclusive events
A. Outcome tables, probability trees, Venn diagrams
Learning How to Count
The Gambler's Fallacy
Monty Hall Problem Explained Using Conditional Probability
is the mistaken belief that, if something happens more frequently than normal during some period, it will happen less frequently in the future,
or that, if something happens less frequently than normal during some period, it will happen more frequently in the future
(presumably as a means of balancing nature).
Slot machines and casinos in general take advantage of this type of thinking.
The original (3×3×3) Rubik's Cube has eight corners and twelve edges.
There are 8! (40,320) ways to arrange the corner cubes. Seven can be oriented independently, and the orientation of the eighth depends on the preceding seven, giving 37 (2,187) possibilities.
There are 12!/2 (239,500,800) ways to arrange the edges, since an even permutation of the corners implies an even permutation of the edges as well. (When arrangements of centres are also permitted, as described below, the rule is that the combined arrangement of corners, edges, and centres must be an even permutation.)
Eleven edges can be flipped independently, with the flip of the twelfth depending on the preceding ones, giving 211 (2,048) possibilities
4.0 That's So Random (?)
probability, types (empirical, theoretical, subjective), compliment, law of large numbers, mutually exclusive, non-mutually exclusive, conditional probability, independent events
Calculate probabilities using:
Venn diagrams, tree diagrams, outcome tables, the compliment property, conditional probability, multiplicative rule for independent events
Winning the Lottery
Does order matter?
How many ways can you arrange a group of objects?
How many different groups can you select from a set of objects?
You must design a game that is:
Interesting and original
Easy to play
Profitable (for the people running the game) in "Data Dollars"
Probability not obvious for the player!
You can use dice, spinners, coins, random number generators, cards and more (but you are responsible for bringing it in).
We will be having a "Game Fair" the Friday AFTER March Break where we will all get a chance to play each others games. You will record all the outcomes of each trial for players playing your game.
You will also have to submit a written portion (due the following week) including:
Rules of the game (available to players at the Game Fair)
Theoretical probability of player winning
Player's expected return per game (probability of winning x winnings)
Summary of results from Game Fair
Comparison of actual and theoretical probabilities
3 letters followed by 3 numbers
3 numbers followed by 3 letters
4 letters followed by 3 numbers
3 numbers followed by a letter
2 letters followed by 3 numbers
followed by 2 letters
How many total possible 7-digit phone numbers are there?
How many total possible postal codes are there?
Example: M9R 1V8
How many possible license plates are there?
Other license plates
How to get there?
How Secure is My Password?
There are some simple rules that you must follow when changing your password:
Your password must be exactly 8 characters long.
It must start with a letter.
Remember that UPPERCASE letters are different from lowercase letters (for example, A is treated as different from a).
It must contain at least one character that is not a letter, such as a digit.
How many possible passwords?
When choosing usernames and passwords for users and groups you add to your Google for Work account, consider the following:
Usernames can contain letters (a-z), numbers (0-9), dashes (-), underscores (_), apostrophes ('}, and periods (.).
Usernames can't contain an equal sign (=), brackets (<,>), plus sign (+), or more than one period (.) in a row.
Passwords can contain any combination of ASCII characters and must contain a minimum of 8 characters.
First and last names support unicode/UTF-8 characters, with a maximum of 60 characters.
Periods (.) are not ignored as they are in a gmail.com account. If you create a user account called username, this user will not be able to receive messages addressed to user.name, or us.er.na.me, or any other combination of periods. To let a user receive mail with these variations, create an email alias for them.
When creating a password you have the following characters which you can use:
numbers (10 different ones: 0-9)
letters (52 different ones: A-Z and a-z)
special characters (32 different ones).
1. 6 digit password with letters
2. 6 digit password with letters and numbers
3. 8 digit password with letters and numbers
4. 10 digit password with letters and numbers
5. 10 digit password with letters, numbers and special characters
Do these help?
How many possible passwords?
An Introduction to Data
You will be getting a rubric tomorrow!
Gapminder TED Talk
How not to be ignorant about the world TED Talk
Information is Beautiful
1) Excel & Numbers
3) Gapminder World
Girls do more homework than boys?
2.5 Big Data
2.6 Your Data
Toronto Star Article
Chapter 2 of Original OECD Study
The ABC of Gender Equality in Education: Tackling Underperformance Among Boys
According to this Toronto Star Article, yes they do...
2.1 Case Study
What do you know about Science?
Pew Research Center Quiz
Pew Research Results
"Public's Knowledge of Science and Technology"
Below is a quiz written by Pew Research Centre to measure Scientific Literacy in the USA. Try it yourself and see how you compare!
2.3 Case Study
Pew Research Center Survey Methodology
What is Fair?
player has an equal chance of winning or losing or all players have the same probability of winning
how much a player would expect to walk away from after playing a probability game
Case Study: 1972 Ford Pinto
and the business of probability
2.3 Always go to the Source
2.2 What is Fair?
Utilitarianism: philosophy that states the best choice is the one that benefits the most people (maximizes utility)
Some Sources of Data:
Sports leagues (NHL, NBA, MLB, MLS)
Beware the Filter Bubble
Pew Hispanic Center. (2004). Changing channels and crisscrossing cultures:
A survey of Latinos on the news media [Data file and code book]. Retrieved
The in-text citation would be "Pew Hispanic Center (2004)" or "(Pew Hispanic Center, 2004)."
Referencing Data in APA
format of data
1. What surprised you most about the article?
2. Was the headline successful in catching your attention?
3. Has reading this article changed way you think about sunbathing? Why or why not?
4. Why do you think Mr. Notten chose this article for you to read?
5. Take out the marking scheme for your 2.4 Assignment. How do you think this article would score on the assignment?
6. Name 5 things that may indicate this article may be using unreliable data or may be purposefully misrepresenting data.
7. Put a star (*) everywhere in the article you think there should be a in-text citation.
8. What can we learn from this article about reporting on data?
More Discussion Questions
"Scientists Blow The Lid on Cancer & Sunscreen Myth"
Correlation vs. Causation
Does correlation imply causation?
In your group, choose 2 quantitative variables you think have a correlation. You will need to collect the data on the students in the class and create a scatter plot with the data. Here are some example of variables...
• Arm span
• Shoe size
• Foot length
• Hours spent watching TV
• Hours spent looking at a screen
• Number of siblings
• Number of electronic devices owned
• Number of Facebook friends
• Snapchat score
• Number of tweets
• Number of twitter followers
• Number of phone contacts
• Hair length
• Number of courses taken this semester
• Number of math courses taken this year
Or come up with your own!
1) Choose your variables, identify the independent and dependent variable
2) Collect data from the class
3) Plot the data, draw a line of best fit, describe the correlation
4) Is this a causal relationship? [Does the independent variable affect the dependent variable?]
Google Trends - Search Frequency
Google Ngram Viewer - Appearance of Words in Books
The Big Data of Google
Do cell phones cause cancer?
10 awesome sites
1. Describe the correlation in the scatter plot above (linear, non-linear, strong, weak, positive, negative)
2. Do you think this is a causal relationship?
If so, explain why you think so.
If not, identify a common-cause factor
3. Is is the data in this graph easy to understand or is it misleading? What would you change about it?
Respond to the following...
What are the limitations of 'small data' and how can we overcome them?
A) What is 'small data'?
B) This is why you shouldn’t believe that exciting new medical study
C) What is meta-analysis and why is it useful?
D) What is big data and why is it useful?
Google search, maps, image search
Facebook, youtube, twitter, instagram
Homework: Make a list of anything that is regularly collecting data on you (phone GPS, apps, itunes, websites, retail locations).
Can produce false positive results or over exaggerated results that lack accuracy
Not enough research. Small data = small numbers
Small amount of data entry may be insufficient to obtain a concrete trend
The limitations of small data may not be accurate as it doesn't generalize to a larger population. The population is broad
Only one case study. Not many factors studied
data cannot be always allocated accurately to a given industry and therefore it becomes necessary to assign these data to an additional industry
Influences researchers cannot control
Environmental factors can alter results therefore creating inaccuracy
small number of data points (10, 100, 1000)
data reported in individual studies
analysis of a large quantity of studies identifying trends in the studies themselves
Part 2: Find correlation [individual]
use your data hunting skills to find correlation between data in the chosen issue
data must be from a reliable source
data must be approved by teacher
student will submit a scatter plot along with description of correlation
Part 3: Compile data [topic group]
compile each team members' data into a common Google Spreadsheet so all members of the group can access and edit the data at any time
Part 4: Create a infographic [topic group]
the topic group will work together to create an infographic based on the data they have collected in parts 2 and 3
Part 1: The issue [topic group]
Create an infographic on a current issue
decide what topic most interests you and meet with the rest of the topic group
decide on a current issue that you will all search for data on
issue is to be approved by teacher
Some big data resources
IHME (Institute for Health Metrics and Evaluation)
crime rate in Etobicoke
effect of crime rate on education
new sex ed curriculum
streaming in grade 9 (applied and academic)
EQAO and Literacy Test
effect of social media on school marks
compare different devices/apps
effect of screen time on school marks
Toronto Crime Rates
EQAO, Stats Canada, Pew Research etc.
Google nGram Viewer
Choosing a topic
Complete and show examples
(2.3 in text)
[equally likely to choose anybody]
[survey every nth person]
[divide population into groups and randomly sample the groups]
[divide population into groups and do a census of randomly selected groups]
[divide population into groups and do a simple random sample of randomly selected groups]
[non-repeatable, samples require destruction of subject, not done on people]
(2.4 in text)
simple, relevant, specific and readable.
1) avoid jargon
2) avoid abbreviations
4) leading questions
Survey questions should be...
Bias in Surveys
(2.5 in text)
[chosen sample does not represent population]
[surveys not returned, skewing results]
[overrepresentation of one group of respondants]
[factors in the sampling method affecting the results]
Types of bias
1) stratified random sampling
1. What type of sampling is used in this survey?
2. Is the data collected quantitative or qualitative? Discrete or continuous?
3. Identify the question type for each question. Would you have changed any of the types of questions?
4. Why do I ask for your name at the beginning of the survey? Is this a good idea?
5. Was the survey biased in any way? What types of bias were there and how would you fix it?
6. Why is it common for a survey to ask questions about race, age or gender?
Mr Notten's Survey
[not everyone has an equal chance of being selected, or the group is handpicked]
[everyone in a population is sampled]
Why do we usually try to do random sampling?
What is the issue with non-random sampling?
Do you think most surveys done are random or non-random?
Your survey questions will be evaluated on this criteria.
How can we avoid bias when we do a survey?
Creating a survey...
Google Forms (part of Google Docs)
[requires free Google login]
Survey Monkey (surveymonkey.com)
[requires you make a free account]
Any other online survey platform
Links to surveys can be shared on Twitter, Facebook etc...
[effective if you want to target a specific group you can physically meet with]
now it's time for you to collect some of your own primary data!
Big Data and Social Justice
Carding in Toronto
3.1 Central Tendency
A. Mean, Median and Mode
B. Central Tendency
How can we describe a set of data based on its distribution?
3.3 Measures of Spread
B. Standard Deviation
3.4 The Normal Distribution
C. Characteristics of Normal Distributions
B. Where can we find normal distributions?
3.5 Applying Normal Distributions
How can we use normal distributions to understand data?
Unit 3 Quiz
4.2 Binomial Distributions
The Digits of Pi
4.4 Hypergeometric Distributions
4.3 Geometric Distributions
3.2 Creating Histograms
A. Creating Histograms
B. Interpreting Histograms
C. Tools for creating histograms
How can we create and interpret histograms?
When are mean, median and mode useful?
choose a quantitative, continuous variable
determine range of data (highest - lowest)
choose appropriate number of intervals (so bin width is easy to work with)
calculate a bin width (range/number of intervals)
make sure no values lie between intervals
count frequency for each bin
scale y-axis and plot data
label the mean, median and mode on the histogram
Collect some data from the class and make a histogram
Why is it important to look at the spread of data?
Remember, histograms are used for quantitative, continuous variables
money in wallet
oldest coin in pocket/wallet
Find the IQR by:
1) Order the data and find the median (Q2)
2) Find the median of each half (Q1 and Q3)
3) IQR = Q3-Q1
A. Interquartile Range
Task: calculate the IQR, standard deviation and variance for your graph
Homework: Pg 168 #1-6, do 7-8 with technology!
a measure of spread based around the median
commonly shown in box and whisker plots
Box and Whisker Plot
a measure of spread around the mean
the higher the standard deviation, the further away from the mean the data is spread
Calculate the standard deviation of a set of data by hand (easiest done in a table):
1) Calculate the mean of the data
2) Calculate the difference between each value and the mean
3) Calculate the square of the difference for each value
4) Sum the squares
5) Plug the data into the standard deviation formula
the square of the standard deviation
another measure of spread
we will use standard deviation more in this unit
Calculating Standard Deviation Using Technology
independent of sexual orientation
How should we ask about gender in our studies?
A. What is the normal distribution?
found in a lot of seemingly random real world observations
we represent patterns in random events using probability distributions: the normal distribution is one example
used as an approximation in statistical data
this isn't always a good assumption
a probability distribution (connect the tops of bars in a histogram) that forms a bell shape
useful because it is relatively easy to model
notation used in textbook: X~N(x, )
example X~N(5,2.3 )
One person is a flipper, one person is a counter
1) Flip a coin 15 times
2) Record how many times heads came up
3) Repeat steps 1 and 2 three more times
4) Tell the teacher how many times heads came up in each set of 15 flips
symmetrical (mean, median and mode all the same)
bell shaped, approaching 0 at the extremes
68% of data within 1 SD
95% of data within 2 SD
99.7% of data with 3 SD
y-axis usually representsthe probability
total area under the curve = 1
4.1 Probability Distributions
Random variable = X
variable subject to chance
Discrete Random variable = X
variable that assumes a unique value for each outcome
Probability Distribution of a Discrete Random Variable
Probability Distribution of a Continuous Random Variable
Calculate the probability of each outcome for the sum when rolling a pair of dice.
Is the sum a discrete or continuous variable?
What is the expected value?
Calculating Expected Value
the digits of pi are considered pseudorandom.
although they follow the pattern of a uniform distribution, the order in which the digits appear don't change
Continuous Uniform Distribution
all outcomes have the same probability
seen in randomly distributed natural phenomena
seen in randomly distributed natural phenomena
Discrete Uniform Distribution
all outcomes have the same probability
dice roll, coin flip, digits of pi
probability of getting a certain number of successes in a Bernoulli trial
approximates normal distribution
how many heads in 15 coin flips?
number of trials needed to get a success in a Bernoulli trial
approximates exponential distribution
how many times do you have to roll a die before you get a 6
number of successes in a Bernoulli trial with no replacement
experiment where there are 2 outcomes: success and failure
probability of success = p
probability of failure = 1 - p
trials are independent
Back to the Quincrux
1) Name 5 examples of Bernoulli Trials
2) What is p for a coin flip
3) What is 1-p for a coin flip
4) What is the probability of getting all heads with 5 flips
5) What is the probability of getting 3 heads with 5 flips?
6) What is the probability of getting 7 heads in 15 flips?
Some important distributions...
E(X) = np
P(X = x) = (1-p) p
Probability of k successes in r trials with removal
Probability of x failures before a success
Probability of k successes in n trials