Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading content…
Loading…
Transcript

------------

---------------

---------

--------

---------------

------------

---------------

---------

Bias in System Evaluation | Julio Gonzalo (UNED, Spain)

It's the evaluation, stupid

What's wrong with this system?

Query: "cancer cure"

YouTube: "beet juice cures cancer in 48 hours"

(74% YouTube results in Spanish are alike)

What's wrong with this system?

Nothing is wrong with the systems...

Youtube algorithm is asked to maximize viewing time... And it delivers

Google algorithm is asked to maximize user satisfaction... And it does

The problem is in the optimization function (the quality metric!)

Your soul for 7 wishes

be rich and powerful and married to Alison

colombian drug lord hated by his wife Alison

crying all the time. Alison leaves him for a strong rude man

emotionally sensitive

woman magnet super star athlete

that, with low IQ and small penis

famous writer - gay

intelligent, witty and well-endowed

Lesson 2

If a Machine Learning algorithm grants you a wish, keep calm and think twice

System Biases

Algorithmic Bias

Data Bias

ML

Evaluation Bias

Lesson 1: It's the evaluation, stupid

Incentives

Private

Public

Incentive

user satisfaction

attention retention

sheer curiosity

publishability

70% of all research

70% of all publications

outcomes

IRREPRODUCIBLE

IRRELEVANT

big data?

Lesson 3

Always remember why you wanted to do research

COMPAS: risk of future criminal activity

white people

black people

2 out of 4 false positives:

HIGHER FP RATE

1 out of 7 false positives

COMPAS bias: wait... Predictive Parity?

white people

black people

2/3 precision

4/6 precision:

EQUAL PRECISION

(PREDICTIVE PARITY)

2 out of 4 false positives:

HIGHER FP RATE

1 out of 7 false positives

Nature 558, 357-360 (2018)

It's beyond statistics!

"Treating equals equally" (predictive parity)

vs

"The innocent should not be punished"

(unless they are black?)

Mark MacCarthy: Measures of Algorithmic Fairness Move Beyond Predictive Parity to Focus on Disparate Error Rates

Two Myths

Yuval Noah Harari, The myth of Freedom, (The Guardian 14/07/2018)

- Free will

- "It's not the tool, it's what we do with the tool"... Maybe not!

Lesson 4

Connect with Social Sciences & Humanities

recommended lecture - Joe Edelman: Is anything worth maximizing?

Bias in System Evaluation

Proper evaluation should be

  • Valid (meaningful, unbiased)
  • Replicable
  • Generalizable (valid predictions)

but....

Prediction is very difficult, especially about the future (Niels Bohr, physicist)

Replication is very difficult, especially after the first occurrence

(Stefano Mizzaro, data scientist)

Hard to predict performance for new problems, for similar problems, even for identical problems.

The bridge metaphor in CS

Lesson 5

Be an open-minded researcher & reviewer! Avoid niches.

Recommended Reading

METRICS

Evaluation Metrics

Task

Metrics

Thermodynamics 101

Task

Metrics

Recommendation biased by Metric

Netflix challenge:

predict user ratings

(classification)

Real task:

find something the user will like

(ranking)

Filtering Metrics have little correlation

How they evaluate non-informativeness is key

Task + Scenario -> Metric

SCENARIOS

Show ads about Vodafone

(absolute gain/loss per item)

Estimate Orange online presence

(non informative = useless)

TASK:

text contains "orange":

Does it refer to the telecom?

Reputation monitoring for PR

(if in doubt, keep it)

Amigó et al. 2019, Information Retrieval Journal

Failure example at UNED

Topic Detection for Online Reputation Management:

RQ1: Can we learn similarity functions from annotated data?

RQ2: Can semantic signals be used effectively?

Yes and Yes --> SIGIR full paper

Success? No... Scenario!!!!

Lesson 5

Start with the metrics, don't settle for popular choices, and don't stop until you've truly captured task + scenario for your problem (iteration & update probably needed)

reminder Lesson 2: you are dancing with the devil

Back to RecSys: Ok, ok, so let's go ranking

Which metric?

  • Use the most popular (comparability)
  • Use the simplest (transparency)
  • Get creative (inconformism)
  • Mirror, mirror, who is the prettiest?

How many ranking metrics satisfy five intuitive formal constraints?

>

<

Amigó et al, SIGIR 2013

None!

Rank-Biased Precision gets close

confidence constraint

returning nothing is better than returning bullshit

<

Failure example at UNED: clustering task

ranks

2nd!

Amigó et al 2009, IR journal

- Four intuitive constraints

- Only BCubed Precision & Recall satisfy all constraints

- Yes, it solves our initial problem :-)

Lesson Observation

Research on Evaluation stands the test of time better

Lesson Hint

If in doubt when choosing a metric, a starting point is

ranking: Rank-Biased Precision

clustering: B-Cubed

filtering: how do you want non-informative systems

to be evaluated?

What is this task?

paper

reviewer

accept

leaning to accept

leaning to reject

reject

No specific metric for ordinal classification!!!

(people ignore the rank or assume distances)

What about multiple quality dimensions?

F1

Lesson 5

Start with the metrics.

Your metrics define your problem. They can be your GPS or your kryptonite

(Lesson 4b: Yes, it can be difficult!)

"Data is not the new oil; it's the new plutonium"

DATA

ACQUIRING EVALUATION DATA

callingbullshit.org

Wisdom of the Crowds?

Only if there is

diversity

independence

decentralization

Harvesting

Bias in the Web (Baeza-Yates)

second-order bias

manipulated crowds

Selection: RecSys example

Cañamares & Castells, SIGIR 2018: fill the gaps

random sampling =

popularity bias

Selection

Cañamares & Castells, SIGIR 2018 best paper award

lab vs real world performance

"Hate Speech Detection is not as easy as you may think"

  • Social Platforms struggle with hate speech --- but SotA reports claim 93% success ?????
  • Authors re-evaluate best approaches to avoid data overfitting (WWW 2017 & ECIR 2018)
  • Outcome: Performance drops from +90% to -80%
  • Cross-dataset: BELOW 50% F1 even in the same domain!!!

  • Why? Dataset bias. In Waseeem & Hovy 65% of hate messages come from two users

(Arango et al, SIGIR 2019)

Annotation Example:

Textual Entailment

(Gururangan et al, NAACL 2018)

Annotation

Contradiction via negation

neutral via purpose

Geva et al. 2019: Are we modeling the task or the annotator?

Tasks: Inference, Common Sense QA

Observation: using annotator id as feature improves up to +8%

Lesson 7:

Assume that your data, your data selection and your data annotations are biased. Find the bias!

learned without looking at the premise!!!!

EXPERIMENTAL

DESIGN

A/B testing issues

  • Weak baselines ("improvements that don't add up")
  • Overfitting
  • Difficult to link an empirical result to the research question (too many parameters)
  • Difficult to compare with state of the art
  • Replicability (see CENTRE)
  • system outputs usually not available
  • Lab experimentation often ignores real-world use cases

Weak Baselines: IR (SIGIR 2019)

Yang et. al, SIGIR 2019: Is Neural Hype Justified?

Weak baselines: RecSys (RecSys 2019)

Neural Collaborative Filtering (WWW 2017) is established baseline

BUT it is worse than well tuned simple methods

Other DL methods compare to Neural CF as baseline

Finding: 6 out of 7 top DL methods are worse than a simple kNN baseline

Ferrari et al., RecSys 2019: Are we really making Much Progress?

Leaderboards: the case of NLP & GLUE

- Contextual Word Embeddings: powerful semantic & syntactic representations

- Solve many tasks at once! ("pre-training")

- Sesame Street & Transformers crossover

- GLUE leaderboard

- Twitter-fast state of the art

Glue Benchmark LeaderBoard

Glue Leaderboard

GLUE Leaderboard

Advantages:

- state-of-the-art baselines

- less overfitting

- generalizability

Disadvantages:

- data is buried

- task nuances are buried

- qualitative analysis is buried

- publication system collapses!

Too fast!

EvAll online evaluation service

- cost of entry:

evaluation in 4 clicks

no registration required

universality

only requires runs & assessments

+ benefits:

full written report (pdf,latex)

metric details, suitability, statistical relevance

tasks & systems repository

Lesson 8

Always try to use state of the art as baseline

Also remember

2: keep calm and drink mojito

5: be a reasonable reviewer

ANALYSIS

Analysis and "publishability" bias

unanimous improvement

biased improvement

predicting results in a different test collection

Unanimous Improvement Ratio

  • Sensitivity versus reliability
  • Finding versus explaining differences
  • Averages that hide behavior (across test cases, classes, metrics)

- Macro-average F1: arithmetic average (over classes) of the harmonic average of P,R for each class

- and then GLUE averages with other n results from other tasks

"Empirical studies have become challenges to be won, rather than a process for developing insight and understanding"

Lesson 9

Remember it is all about insights, understanding and generalizability, not about winning.

And please go beyond averages!

What to do?

In summary

1 It's the evaluation, stupid

2 If a ML algorithm grants you a wish, keep calm and think twice

3 Always remember why you wanted to be a researcher

4 Connect with Social Sciences & Humanities

5 Be an open minded researcher & reviewer! Avoid niches

6 Start with the metrics, don't settle for the popular choices, and don't stop until you have truly captured task + scenario for your problem

(Hint) If in doubt when choosing a metric, a starting point is RBP for ranking, BCubed for clustering, think of non-informative cases for filtering

7: Assume that your data, your data selection and your data annotations are biased. Find the bias!

8: Always try to use the state of the art as baseline

9: Remember it is all about insights, understanding and generalizability, not about winning. And PLEASE go beyond averages!

"Religion is the culture of faith. Science is the culture of doubt" (R. Feynman)

change incentives

  • Encourage publication of
  • negative results
  • test collection building
  • replication results
  • Reject blind application of ML package X to problem Y

change

incentives

Change evaluation focus

change

evaluation focus

  • Focus on reliable predictions, not sensitive outcomes
  • Explain rather than find differences
  • Go beyond averages
  • Measurement theory & metric curation
  • Publishing procedure beyond text
  • ...
Learn more about creating dynamic, engaging presentations with Prezi