**Methods in Urban Planning**

Brian J. McCabe

September 16, 2013

Course: Statistical Literacy for Planning Professionals

1. Neighborhood Walk

2. Basic Statistical Tools

3. Key Neighborhood Indicators

4. Data Collection & Sources

5. Demographic Trends in Washington, DC

Brian J. McCabe

September 16, 2013

Course: Statistical Literacy for Planning Professionals

1. Neighborhood Walk

2. Basic Statistical Tools

3. Key Neighborhood Indicators

4. Data Collection & Sources

5. Demographic Trends in Washington, DC

Unit of Analysis

: Unit of observation that we're analyzing (e.g., individual, neighborhood, SMD, etc.)

Variable

: Any characteristic that changes - or varies - from one observation to another.

Four Levels of Measurement

:

Nominal, Ordinal, Ratio, Interval

Reliability

: Refers to the consistency of a measure, whether it produces the same result across time.

Validity

: Refers to whether the measurement you use actually gets at the concept you're trying to measure

Measurement Error

: Recognizes the imperfections of measurement, that measurement of social phenomena is rarely perfect.

Count

: Simple counts of the number of times something occurs

Proportion

: The total number of items in a group relative to the number of items in total.

Percentages

: The proportion multiplied x 100.

Rate

: The frequency of an outcome, relative to a base number

Ratio

: Comparison of one sub-group to another.

Percentiles

: The value of a variable below which a certain percentage of observations fall (e.g., a score in the 25th percentile means that 25 percent of scores fall at or below that score).

Frequency Distribution

: A method for understanding all of the observations that share a particular property; it displays that number of times (or frequency) that a particular property occurs.

Measures of Central Tendency:

Mean

: Equal to what we colloquially think of as the average, the mean is equal to the sum of scores divided by the total number.

Median

: The middle score in an ordered distribution.

Mode

: The score that occurs most frequently.

AMI: Area Median Income

(e.g., 30% AMI, 50% AMI)

Rate of 311 calls per neighborhood

e.g., Last year in Shaw, the rate of 311

calls was 75 calls per 100 residents.

Percentage of students passing standardized exams

Percentage of residential units that are owner-occupied (percentage of homeowners)

FAR: Floor-Area Ratio

Floor area = 1,000 square feet

Plot area = 500 square feet

FAR = 1,000/500 = 2.0

Measures of Variability:

Range

: The distance between the minimum score and the maximum score.

Standard deviation

: A statistic that tells us how far all of the scores are spread around the mean of a deviation.

Collecting Data:

- Census Data: Collected decennial through 2010, surveyed the entire population.

- American Community Survey:

1-year estimates (geographies > 65,000)

3-year estimates (averaged, geographies > 20,000)

5-year estimates (census tracts, zip codes, etc.)

ACS vs. Decennial Census

- Sampling error (and margin of error)

- Concerns about measuring social change on small scales

- five-year estimates, rather than one-year counts

Collecting data

- Administrative data (e.g., Department of Human Services, Metropolitan Police Department, Office of Tax and Revenue, etc.)

- Publicly-available data (at data.dc.gov)

- NNIP: National Neighborhood Indicators Partnership (Neighborhood Info DC website)

- Census (Decennial), American Community Survey (rolling estimates (e.g., 2005-2009)

- ANC/SMD

- Zip Code (n=28)

- Neighborhood Cluster (n=39)

- Police Service Areas (n=56)

- Ward (n=8)

- Census Tract (n=~180)

Advantages/Disadvantages of

collecting or analyzing data at

each of these units of analysis?

Why would we choose to collect -

and report - data at each of these

geographic areas? What data would

be relevant or useful at each

geography?

At the neighborhood cluster level,

what are some variables that planners

might be interested in measuring?

- Population characteristics

- Density of businesses

- Housing indicators

- Crime statistics

- Percentage of land area zoned residential

- Number of affordable housing units

- Imprecise tools

- Poorly-worded survey questions

- Interview biases

- Respondent biases (e.g., social desirability)

- Coding errors

When would we use each of

these measures? How do they

differ from one another?

Why median household income or

median housing value, rather than

the mean income or value?

Neighborhood:

Mt. Vernon/Shaw/Convention Center

Neighborhood Cluster 7: Shaw/Logan Circle

Neighborhood Cluster 8: Includes Chinatown, Mt. Vernon Square

Carnegie Library at Mt. Vernon Square

Convention Center

City Market at O St.

Jefferson Market Apartments

Parcel 42

Commercial Corridor - 9th Street

Bread for the City

1

2

3

4

5

1

2

Discrete

Continuous

Nominal

Ordinal

Ratio

Interval

Special Category: Dichotomous

Measuring Gentrification:

What variables could we use

to measure whether or not a

neighborhood is gentrifying,

and how much gentrification

has occured?

Number of affordable housing units in Shaw

Number of 311 calls last year in Chinatown

Proportion of affordable housing units that

are in Shaw (the number of units in Shaw

divided by the total number of units in

the city)

Proportion of 311 calls that were made from Chinatown (the number of calls in Chinatown

divided by the total number in the city)

Why do we care about the rates,

rather than just counts?

Ratio of renters to homeowners

Owner-occupied housing units: 100,000

Renter-occupied housing units: 150,000

Ratio of renters to homeowners = 150,000: 100,000 = 1.5:1

Average (Mean) Salary for

Teachers in DC Public Schools:

$$77,512

Visual Displays of Quantitative Information

- Bar Graphs

- Histograms

- Line Graphs

Bar Graphs: A

bar graph

is a visual display of discrete categories (either nominal or ordinal) where the

length of each bar

represents the

percentage of frequency

of a category.

Histogram: A

histogram

is a visual display for

continuous data (interval/ratio)

where the scores are presented along one axis and the frequency (or percentage) of that score is presented along the other axis. Often, continuous data are recoded into categories before the construction of a histogram (e.g., a continuous GPA may be recoded into intervals of 0.10).

Line Graph: A

line graph

is a visual display of data typically used to track a social phenomenon across time, or some other continuous measure.

Pie Chart: Pie charts aren't particularly good for displaying statistical information. First, and most importantly, pie charts (like bar charts or histograms) can tell us about the relative relationship between two variables, but tell us nothing about their frequency. Second, it is often difficult to correctly visualize the relative size of a piece of the pie.

Basic Rules for Good Data Visualization:

1. Data visualization are used to tell a story. When you create a graph or chart, make sure that it tells a story. Viewers should be able to "read" the story with only the chart (and no accompanying text).

2. Make sure to select an appropriate type of graph. Line graphs track trends across time, bar charts display data across discrete categories, etc.

3. Pay attention to details. Clearly label your axes. Ensure consistent scales on the axes. Include a legend (where appropriate). Write titles that identify the information in the chart.

4. Avoid perceptual distortions. The relationship between visual components should provide a quick understanding of the story.

5. Minimize data "junk". This includes excess colors, symbols, and information not directly related to the data story itself.

Course Outline

1. Final Projects Discussion + Groups

2. Discussion: American Murder Mystery Revisited

3. Review: Descriptive Statistics

4. Analysis: Descriptive Statistics in Minitab

5. Visual Displays of Quantitative Information

6: Analysis: Charts & Graphs using Minitab & Excel

7. Advanced Analytical Techniques

American Murder Mystery Revisited

Describe the research question. What is the debate the authors are entering into?

What are the theories linking housing vouchers to crime? What are the possible mechanisms that would explain this relationship?

How did the authors test the relationship? What kind of data did they use? What statistical techniques did they use?

Group 1: McMillan/Pleasant Plains/Bloomingdale/Eckington/Stronghold

(Armed Forces Retirement Home / Michigan Ave./Irving to the North, 5th/Park Place to the west, Florida Ave. to the south, 2nd Street NE / Glenwood Cemetery to the east)

Group 2: Near Southwest (approximate boundaries – SE/SW Freeway on the north, Washington Channel and Anacostia to the south, 14th St. / and Bridge to the west and South Capitol Street to the east)

Group 3: Georgia Ave. Gateway/Takoma DC – DC Boundary (Eastern Ave.) on the north and east, Piney Branch / Tuckerman on the south, Rock Creek Park on the west)

Many planning documents begin with an overview of the demographic characteristics of the neighborhood or community. They discuss population shifts and outline the demographic composition of the neighborhood. Often, they discuss the market conditions of particular places, including rental prices or home sale values. These portraits of local neighborhoods help citizens and planning professionals understand the neighborhoods in which they are working. You are required to write a 3-4 page (double-spaced) demographic analysis. Focus on using the quantitative data available to tell a story about the neighborhood for outsiders unfamiliar with it.

Use existing planning documents as your guide in this process. Before embarking on your own project, look at some of the planning documents or historic preservation reports released by the Office of City Planning. (Already, we have looked at those reports for Mt. Vernon Square and the area surrounding the convention center.) Look at the types of information they report, and the way they organize quantitative information.

The written portion of the assignment should include an interpretation of quantitative data that you have compiled from existing sources. You are expected to create 2-3 figures presenting a visual display of your data. These can include bar charts, line graphs, or other visual displays common in the planning literature. The demographic analysis should tell a convincing story about the neighborhood you are studying. It is not enough to simply list statistics for the readers to interpret; instead, you should use these statistics to tell a story about the neighborhood.

The quantitative data analysis assignment is due on Monday, November 18th.

Example: The number of subsidized units (count) by the type of subsidy (discrete) in Washington, DC.

How many of each type of subsidized housing unit (e.g., public housing, HCV, and LIHTC) exist in Washington, DC?

Example: The median household income (continuous) in the neighborhoods bordering the School of Continuing Studies.

How does the media income of the neighborhoods near SCS differ?

Example: Homeownership rate in neighborhoods across Washington, DC.

Instead of creating a bar chart with 39 bars (one for each neighborhood cluster), we might create a historgram to show us the frequency that each homeownership rate (continuous) occurred.

Example: Trends in the rate of violent crime over time.

How has the crime rate changed in Shaw/Logan Circle over the last ten years?

Correlation:

Best Fit Line

(Beginnings of Linear Regression):

We often talk about social phenomena that are correlated. When we discuss correlation, we're considering two continuous measures that co-vary - or that vary together.

When the value of one variable systematically changes as the value of the second variable change, we say that the two variables are correlated.

A scatter plot is a two-dimensional graph that shows the coordinates between two variables - X and Y - for all the observations in a data set. It provides visual evidence to assess whether two variables are correlated.

As reading scores increase, writing scores increase, as well. We would say that reading scores and writing scores are positively correlated.

Each dot on the scatter

plot is a different observation

in our data (in this case, each

dot is a different student

in our data)

**Two continuous variables - X and Y - can**

be said to be related in one of two ways:

1. Positive Correlation.

- When the value of X increases, the value

of Y increases.

2. Negative Correlation.

- When the value of X increases, the value

of Y decreases.

be said to be related in one of two ways:

1. Positive Correlation.

- When the value of X increases, the value

of Y increases.

2. Negative Correlation.

- When the value of X increases, the value

of Y decreases.

**In addition to noting the direction of a correlation, we can talk about how strong the correlation is.**

For example, shoe size and height are very strongly correlated. We can have a pretty good guess about what your shoe size is when we know your height.

Other variables have an association, but the correlation is much weaker. For example, we might know that hours slept is weakly correlated with exam scores. There is a relationship between them, but it is not a particularly powerful.

As a rule of thumb, we generally think of a correlation less than 0.2 as weak, 0.2 to 0.5 as moderate, and above 0.5 as strong.

For example, shoe size and height are very strongly correlated. We can have a pretty good guess about what your shoe size is when we know your height.

Other variables have an association, but the correlation is much weaker. For example, we might know that hours slept is weakly correlated with exam scores. There is a relationship between them, but it is not a particularly powerful.

As a rule of thumb, we generally think of a correlation less than 0.2 as weak, 0.2 to 0.5 as moderate, and above 0.5 as strong.

**Scatter Plots & Correlations (in Minitab or Excel):**

1) Median family income & housing values?

2) Property crimes & housing sales

3) Poverty rate & % foreign-born

4) Unemployment & % black

1) Median family income & housing values?

2) Property crimes & housing sales

3) Poverty rate & % foreign-born

4) Unemployment & % black

The "best fit" line.

There are an infinite number of

lines that I could draw through the

data. How do I know which one

is the "best fit" line?

The "best fit" line is the line

that minimizes the amount of

error between each observation

and the regression line.

For the moment, suffice it to say

that the "best fit" line is the line

that best reduces the amount

of error between each observation

and the line.

For each observation, the difference between the observed value and the predicated value is the error term.

**Note: Pearson's r always ranges from -1 to 1.**

The sign indicates whether the variables are positively or negatively correlated.

The value (absolute value) indicates the strength of the correlation.

-1 indicates a perfect negative correlation

1 indicates a perfect positive correlation

0 indicates that variables are uncorrelated

The sign indicates whether the variables are positively or negatively correlated.

The value (absolute value) indicates the strength of the correlation.

-1 indicates a perfect negative correlation

1 indicates a perfect positive correlation

0 indicates that variables are uncorrelated

**Concept: Pearson**

Correlation Coefficient

Correlation Coefficient

**Logistic Regression: We use a logistic regression when the outcome variable is dichotomous (yes/no), rather than continuous. With logistic regression, we talk about odds or odds ratios.**

Example: Distribution of the number of housing choice vouchers across census tracts in DC.

Data Analysis using Minitab

- Simple "point & click"

- Doesn't require programming knowledge

- Useful for simple descriptive statistics

- Challenges of large datasets

- Good for analysis; not good for visual displays

- Free and available from UIS @ Georgetown

Three Datasets

- Federally-subsidized housing units in Washington, DC (Source: HUD Picture of Subsidized Households)

- Federally-subsidized housing units in Washington, DC, by Census Tract (Source: HUD Picture of Subsidized Households)

- Aggregated Census/ACS data in Washington, DC, by neighborhood cluster (Source: Neighborhood Info DC)

Basic Analysis:

- Find mean, median

- Display counts

- Describe the distribution, range

Question: What is the median number of public housing units in a census tract in Washington, DC?

Question: What is the mean/average number of public housing units in a census tract in Washington, DC?

Question: How many census tracts in Washington, DC contain zero public housing units?

Question: What is the average household income in the census tract with the highest number of public housing units?

Question: Recode the data on public housing units to create a frequency table showing the number of census tracts containing zero public housing units, 1-50 public housing units, 51-100 public housing units, and 101 or more public housing units. Create a frequency table.

Question: Recode data on average household income to identify the number of neighborhoods where the median income of public housing residents is greater than $12,000.

To calculate basic descriptive statistics: STAT - Basic Statistics - Display Descriptive Statistics

To calculate basic descriptive statistics: CALC - Column Statistics

To create a frequency distribution: STAT - Tables - Tally Individual Variables

To create a cross-tab: STAT - Tables - Cross-Tabulation and Chi-Square

To calculate correlation: STAT - Basic Statistics - Correlation

Making Charts & Graphs:

- Excel for Bar Charts & Line Graphs

- Minitab for Histograms

Codebook: Each dataset comes with a codebook

that identifies how the variables are coded in the data. Often, it includes numerical identifiers of missing data to recoded before the analysis.

**OLS (Ordinary Least Squares) Regression: We use regression analysis to understand the relationships between variables. OLS regression analysis is used when the dependent variable is continuous. We talk about how a unit-change in an independent variable is associated with an outcome variable.**

**1. Quantitative Data Projects**

2. Data, Big Data + Planning Research

3. Data Availability

4. The Promise of Big Data?

5. Data+ Planning Research

6. Speaker: Kevin Donahue, CAPSTAT

2. Data, Big Data + Planning Research

3. Data Availability

4. The Promise of Big Data?

5. Data+ Planning Research

6. Speaker: Kevin Donahue, CAPSTAT

Quantitative Data Projects

Group 1: McMillan, Pleasant Plains, Bloomingdale, etc.

Group 2: Near Southwest

Group 3: Georgia Avenue/Takoma

- What type of data did you use?

- What type of analysis did you do?

- What did you find?

- Challenges/surprises?

Data Availability: Washington, DC

- What types of data is available to planners and urban policymakers in Washington, DC?

- What types of research and planning applications are useful for those data?

**Demographic Data**

**Housing and Foreclosure Data**

**Property Sales & Assessment Data**

**Property Characteristics**

**Crime Statistics**

**- Building Permits**

- Liquor License

- Public Space Permits

- Educational Data

(Schools, Students)

- Liquor License

- Public Space Permits

- Educational Data

(Schools, Students)

**What is Big Data?**

What are the features of

Big Data?

What are the features of

Big Data?

**Volume**

**Complexity**

**Layers**

Students

Schools

School Characteristics

Teacher Characteristics

Peer Characteristics

Family

TANF

Parents' Employment Status

Food Sources

Mobility

Neighborhoods

Violent Crime

Demographics

Institutions

Bus Routes

Assessment

Examples: 311 Calls in DC

Property as a Complex Structure

- Structural features

- Transaction data

- Neighborhood characteristics

- Transportation accessibility

- Proximity to schools, stores, crime, etc.

- Time variation

**What promise does Big Data**

hold for cities and urban

environments?

hold for cities and urban

environments?

**Challenges of Big Data?**

**- Privacy**

- Complexity/Skills

- Too much data, "noise"

- Complexity/Skills

- Too much data, "noise"

- How do children get to/from school, and what are the reasons behind those choices?

- What is the relationship between home/school distance and the mode of travel?

- What neighborhood characteristics influence those choices?

- Survey research in four middle schools in Bend, Oregon and Springfield, Oregon

- Demographic indicators, geo-coded address data

- Primary mode of travel; whether children ever "actively" traveled to school (e.g., bike, walk)

- Questions about distance to school; measures about urban form (e.g., intersection density, route directness, major roads, railroads)

- What is the relationship between urban form (e.g., metropolitan region, sprawl) and exposure to poor air quality?

- Decennial census data on neighborhoods

- Environmental Protection Agency (EPA) data on neighborhood-level air quality

- Sprawl index, includes several measures of sprawl

- Knowing something about modes of transportation - and the impediments to active modes of transportation - can help planners in thinking about the relationship of schools and homes, or about transportation availability in neighborhoods.

- Addresses questions about whether infill development could improve health outcomes by putting people in dense areas, or whether they increase exposure to poor air quality.

- Do parking requirements (making parking spaces available with residential units) make housing more expensive?

- How do parking requirements shift the development decisions that developers make?

- Surveys with developers

- Administrative data on housing construction, parking spaces

- Contributes to debates about whether new developments should have parking minimums, and especially when these developments are located in cities with a strong public transportation infrastructure and marketed toward demographic groups less like to own cars.

- Does subsidized housing improve neighborhoods by leading to investments in communities or local improvements?

- How (or why) would we expect housing investments to change neighborhood characteristics - and especially, school outcomes?

- Administrative data on the location and type of subsidized housing assistance

- School-level data on student achievement, teacher characteristics

- Addresses issues of externalities associated with the placement of subsidized housing.

- Raises the possibility of positive benefits (to local schools) while most studies are concerned about the negative externalities (e.g., crime)