have found

this beautiful data...

Adjacency network of common adjectives and nouns in the novel "David Copperfield" by Charles Dickens.

And you want to find out

what is the most central word

in "David Copperfield"!

You find a class of algorithms called "centrality indices" that are supposed

to identify the most important nodes

of a network.

Algorithm Literacy

What you need to know

about algorithms

Prof. Dr. Katharina A. Zweig

You find the

betwenness centrality

Wikipedia says:

"It is equal to the number of shortest paths from all vertices to all others that pass through that node.

A node with high betweenness centrality has a large influence on the transfer of items through the network."

The most central word of "David Copperfield" is:

LITTLE

We have been tricked

by

"Algorithmic Folklore"

This is the formula for the betweenness centrality:

WARNING

Before you can download this talk from my harddisk you have to accept the general terms and conditions.

This talk can induce serious hazards to your research and your life as a scientist and a citizen.

Now, let me

get to know you better!

Have you already applied some data analysis method to data and interpreted the resulting values?

Yes

No

Have you already developed a new method of data analysis?

Yes

No

Have you already implemented a method yourself and made it public?

Yes

No

Have you already applied

a data analysis method you learned about not in a textbook but in some paper or in the documentation of a software package?

Yes

No

Now, let me introduce myself...

https://networkdata.ics.uci.edu/data.php?id=4

"A description of a method that contains wrong or at least incomplete instructions

on how to interpret

the results of

the method."

"Statistical rituals largely

eliminate statistical thinking

in the social sciences. Rituals are indis-

pensable for identiﬁcation with social groups,

but they should be the subject rather than the procedure of science.

Gerd Gigerenzer: "Mindless Statistics", The Journal of Socio-Economics 33, 587-606, 2004

In network analysis

I see similar "rituals"

Measuring power-laws by plotting a distribution on a log-log plot and looking for a line

Equating power-law degree distributions with the preferential attachment model

Applying centrality indices without arguing the choice

....

How does it happen?

Especially bad, when software is published for ready-use but the method itself is not clearly defined.

Example: Nestedness

Centrality indices

are tied to network flows, i.e., something that uses the network as an infrastructure. [Borgatti2005]

There is no network flow on the abstract network of correlated important nouns and adjectives...

This method cannot be applied to this data.

Okay, let's assume we have a nation-wide cellphone communication network including persons with known terroristic background. Can the betweenness centrality identify them?

Hidden assumptions

Communication takes shortest paths

More importantly:

All persons want to talk to all other persons in the same way (no weights)

Most important (wrong) assumption:

Terrorists use this network

like other people

Borgatti, S. P. Centrality and Network Flow Social Networks, 2005, 27, 55-71

Air Transportation networks

Using DB1B data, we showed that 40% of all possible pairs of airports within the USA are never asked for.

The rest of the pairs are asked for in very different frequencies.

Dorn, I.; Lindenblatt, A. & Zweig, K. A.:

"The Trilemma of Network Analysis".

SNAM 2012, Istanbul, 2012

Algorithm Literacy

Don't use a method if it is not formally defined and open source

Ask your buddy data scientist questions until you are sure you understand the limits of the method and when it is really applicable

Fire your buddy data scientist if (s)he does not ask YOU enough about your data!

What is a model?

A bit of Science Theory

Weisberg’s definition is cautious:

“potential representations

of target systems.”

(2013, p. 171)

Weisberg, M.:

"Simulation and Similarity: Using Models

to Understand the World",

Oxford University Press, 2013

Weisberg (2013) argues

that models are composed of

two things:

Their structure (e.g., a concrete model, a graph, a mathematical formula, a computer simulation, ...)

A construal containing:

an assignment of real-world elements to the structure

fidelity criteria

intended scope

„To generate a target, theorists choose some phenomenon in the world that they wish to study. From the full contents of the phenomenon, they abstract, omitting all but the relevant features of this phenomenon. This process generates the target system.“ (Weisberg2013, p. 172).

The Target System is

a construction by the modeler

Many data analyses contain TWO models

As seen above, many analyses are based on modeling assumptions

as well. For example:

Centrality indices assume certain flow characteristics;

Clustering algorithms assume an underlying homophily between entities

Network motif analysis requires the choice of a null-model

Terrorist identification and cell-phone networks

Why is it a model? For example because:

1) not all kinds of communication are observed;

2) some people may share a phone, others have multiple phones;

Implicit assumption: this communication network resembles the full one

Okay, algorithms in data analysis are important for me as a scientist.

Why are they changing my life as a citizien as well?

Predictive Policing

Software that predicts time and place of future crime.

PredPol says it reduced crime rates by 10 to 40% in many cities.

Future: predict crime rate of individuals

Algorithmic Folklore

reloaded

Such interdisciplinary, trans-institutional data lines make algorithmic folklore even more likely than in academic circles!

"This software predicts

crime rates"

Data Dependency

Learning algorithms are very much depending on the data they are fed with.

Statistical intuition (and lack thereof) becomes very important to interpret the results.

**Science & Society**

**... needs our interdisciplinary efforts more than ever.**

**Become literate in the algorithms that we depend on!**

Strange(r) Data

Do we need algorithmic leaflets?

... to avoid serious side effects of your analysis, ask your local data scientist or your local algorithm dealer...

Quis custodiet

ipsos custodes?

(who watches the watchmen?)

Tiger Mom Tax

Princeton Review charges customers based on their ZIP code

Asian-dominated regions pay almost double as much as other regions

New Data - New World

Algorithms not yet there

We could show that someone like Facebook can deduce acquaintanceship between non-members.

Horvát, E.-Á.; Hanselmann, M.; Hamprecht, F. A. & Zweig, K. A. One plus one makes three (for social networks) PLoS ONE, 2012, 7, e34740

What I call the “null ritual” consists of three steps (1) set up a statistical null hypothesis, but do not specify your own hypothesis nor any alternative hypothesis, (2) use the 5% signiﬁcance level for rejecting the null and accepting your hypothesis, and (3) always perform this procedure.

I report evidence of the resulting collective confusion and fears about sanctions

on the part of students and teachers,

researchers and editors, as well as

textbook writers."

Please note that it is not the sum of

all shortest paths containing v!

This is the part we are most often aware of: complex networks are an abstract representation to understand a phenomenon of interest.

What about the

analytic method?