**A Guide to Bad Data Journalism**

Anatomy of a Fivethirtyeight article

A Big Question. Click bait. Often causal:

Does X cause Y?

An almost ironic lack of awareness of the inevitable 'correlation ≠ causation' that will follow in the caveats (better articles) or comments (worse articles).

Something that a 400 word article cannot possibly answer, especially if it attempts to use original analysis.

THE TITLE

A few relevant facts and numbers, to show that this is Data Journalism we're engaged with, not some kind of shoddy op-ed.

THE CONTEXT

This topic is 'of great interest' to some academics. Some have probably spent entire careers examining our question. But they're hedgehogs and we're foxes, so on to...

(A NOD TO) THE LITERATURE

In Data Journalism there's always data. We may even go a step further than mainstream journalists and link to the data. But not too specifically (e.g. does 2010 mean GCI year 2009-10 or 2010-11?). Tests on our website showed that drives away readers.

We could put the data and analysis code up online, but Github doesn't support ads so there's not much incentive.

THE DATA

At the time of writing FiveThirtyEight has nine datasets on Github, after nearly two months and - probably - hundreds of articles.

Often a bivariate scatter, ignoring all the other confounding variables that might mediate this relationship. And hence often entirely meaningless.

But this is Data Journalism, so despite that limitation we're obliged to include a bold red regression line. Don't worry: there'll be a couple of caveats later in the text.

THE CHART

Bonus points if we can claim something is 'significant', because this reinforces that we're doing Data Journalism. But we won't include p-values or even state our standard of significance, or show diagnostics like r^2, because tests on our website showed that also drives away readers.

THE (UNFOUNDED) CONCLUSION

Just enough caveats to remind you this is Data Journalism. They'll get lost in the Facebook posts and Twitter retweets, of course, but that's not our responsibility.

It's likely even the caveats we mention are severe enough to undermine the whole article, but we have to submit two articles by lunchtime so it's time to click

Publish

.

THE CAVEATS

Unfortunately, articles like this do more to obscure and confuse than they do to illuminate and inform. Let's look at some of the reasons this analysis is problematic...

Four problems with FiveThirtyEight's

Unions

article

1. 'Significant' is not the whole story

2. Simplistic analysis is misleading

3. A few countries are very influential

4. What are we measuring, anyway?

'The relationship is significant', the article says, as if this is the end of the story. But statistical significance is an arbitrary standard.

The 'p-value', a measure of significance for this regression is 0.04. The usual standards of significance in social sciences are < 0.01, < 0.05 and < 0.1. In other words, this relationship is significant at the 0.05 level, but not at the 0.01 level. It would help a lot in articles like this to report the p-value, so readers could make up their own minds.

Of course many readers will ignore the text, and see only the upwards-sloping regression line, the bold red plotting of which obscures any estimation uncertainty.

It is possible to visualise the uncertainty of a regression line (see left, a 90% Working-Hotelling confidence band), but few articles ever do this.

When a relationship is complex, and mediated by many unmodelled factors, a single model may not fit all the observations well.

To the left the countries circled in blue are the Nordics: Sweden, Norway, Finland and Denmark. These countries have exceptionally high levels of unionization coupled with high competitiveness.

But they're also distinct in many other ways, being famously progressive, egalitarian, socially cohesive societies. Perhaps one of these other differences is important here. If the model is robust, taking them out shouldn't affect the relationship much.

And yet with the Nordics excluded, the regression line is flatter and no longer even close to significant (p-value = 0.73). By itself, this need not mean the model is wrong, but it's certainly worthy of further investigation.

The human world is complex. No interesting causal relationship involves only two variables. When you perform a 'bivariate' analysis like this, you're forcing the one explanatory variable (unionization) to explain all the variation in the dependent variable (competitiveness).

But because the world is not a randomised experiment, unionization will be correlated with lots of other variables that might equally well predict competitiveness.

For example, Wikipedia lists a measure of 'irreligion' (atheism, agnosticism, etc) for countries, based on Gallup polling data. It turns out this is a stronger predictor of competitiveness (r^2 = 0.47) and more significant (p-value = 0.00004). Moreover, in a model with both irreligion and unionization, unionization is no longer significant (p = 0.69), while irreligion still is (p = 0.0004).

You can play this game

ad infinitum

. Without more thoughtful analysis and a good theory, none of these regressions tells us anything.

'Competitiveness' must be a straightforward metric, right?

Wrong. The Global Competitiveness Index (GCI) is prepared by the WEF, to measure 'the set of institutions, policies, and factors that determine the level of productivity of a country'. 'A more competitive economy,' they claim, 'is one that is likely to grow faster over time.'

So, in fact, GCI reflects not growth itself, but inputs, weighed in a way that has traditionally explained growth, but which is not guaranteed to do so - its meaning relies on a model.

If GCI is supposed to predict growth, why not simply use out-turn growth in the analysis? The chart left shows just that: the relationship in insignificant (but negative).

Of course, growth in 2010-2012 was largely determined by how badly countries were affected, and how well they responded, to the financial crisis, rather than anything related to unionization, so don't read much into this.

The point is, variable choice matters: here, different measures of the same thing produce quite different results - and in such a situation we should be suspicious of

all claims

.

Four principles for

better data journalism

2. Embrace complexity

The human world is complex. No interesting causal relationship involves only two variables. Any article titled 'Does X cause Y' is tabloid fare, and does more harm than good. Ban bivariate regressions, unless you're using them to contextualise a more complex relationship.

Quantitative analysis is inevitably reductive: acknowledge this.

1. Choose the right stories

When you take simple analysis to serious, established, complex issues (Does conflict affect a country's credit rating? Do unions affect competitiveness) you cloud, rather than illuminate. In cases like this, a well-written review of the scholarly literature is likely to better inform public debate.

Otherwise, stick to (a) lightweight but fun topics or (b) fast moving topics yet to attract academic attention. Data journalists do both excellently.

3. Use statistics intelligently

A scatterplot of two variables with a least-squares regression line is not 'doing statistics'. It shouldn't be the endpoint of data journalism, either. If you run a regression, tell your readers the p-value(s). Examine regression diagnostics (even the simple ones, like R^2). If something looks funny, rethink your analysis.

If you don't have the time to do this, then drop the statistical analysis entirely. Bad statistics is worse than no statistics.

4. Finally, be modest

Caveats are useful, but if you have so many caveats as to completely undermine any conclusion, then don't offer a conclusion. Sometimes, even most of the time, the conclusion from the data is 'we don't know'. If, considering all the limitations of your analysis, you can't actually say anything, don't.

**Data and R code at http://andrewwhitby.com/bad-data-journalism**

http://fivethirtyeight.com/datalab/do-fewer-unions-make-countries-more-competitive/

Data journalism should leave readers informed, not misinformed.

Data journalism is having a moment.

It's a competitive landscape, with Nate Silver's relaunched

FiveThirtyEight

and the NYT's

The Upshot

entering the ring alongside more established players like the Guardian's

Datablog

.

Unfortunately, competition in online media today seems to reward volume and virality, rather than accuracy. So far, data journalism doesn't seem to be exempt from this.

Short deadlines and the pressure to produce daily content result in much low-quality work

. A lot of noise, and very little signal, if you like.

This is a particular problem for data journalism, as it purports to use empiricism to reclaim the mantle of objectivity which the mainstream media are seen to have abdicated. Just as Silver's poll-driven election model stood as a beacon of reason in 2008's sea of punditry, so too would these sites jettison bias and so-called expertise, and 'let the data speak'.

The problem is that data do not speak, analysts do. And

bad analysis is no better than bad expertise

. Let's look at a typical example...