Introducing
Your new presentation assistant.
Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.
Trending searches
------------
---------------
---------
--------
---------------
------------
---------------
---------
Query: "cancer cure"
YouTube: "beet juice cures cancer in 48 hours"
(74% YouTube results in Spanish are alike)
Youtube algorithm is asked to maximize viewing time... And it delivers
Google algorithm is asked to maximize user satisfaction... And it does
The problem is in the optimization function (the quality metric!)
be rich and powerful and married to Alison
colombian drug lord hated by his wife Alison
crying all the time. Alison leaves him for a strong rude man
emotionally sensitive
woman magnet super star athlete
that, with low IQ and small penis
famous writer - gay
intelligent, witty and well-endowed
If a Machine Learning algorithm grants you a wish, keep calm and think twice
Algorithmic Bias
Data Bias
ML
Evaluation Bias
Private
Public
Incentive
user satisfaction
attention retention
sheer curiosity
publishability
70% of all research
70% of all publications
outcomes
IRREPRODUCIBLE
IRRELEVANT
big data?
Always remember why you wanted to do research
white people
black people
2 out of 4 false positives:
HIGHER FP RATE
1 out of 7 false positives
white people
black people
2/3 precision
4/6 precision:
EQUAL PRECISION
(PREDICTIVE PARITY)
2 out of 4 false positives:
HIGHER FP RATE
1 out of 7 false positives
Nature 558, 357-360 (2018)
"Treating equals equally" (predictive parity)
vs
"The innocent should not be punished"
(unless they are black?)
Mark MacCarthy: Measures of Algorithmic Fairness Move Beyond Predictive Parity to Focus on Disparate Error Rates
Yuval Noah Harari, The myth of Freedom, (The Guardian 14/07/2018)
- Free will
- "It's not the tool, it's what we do with the tool"... Maybe not!
Connect with Social Sciences & Humanities
recommended lecture - Joe Edelman: Is anything worth maximizing?
Prediction is very difficult, especially about the future (Niels Bohr, physicist)
Replication is very difficult, especially after the first occurrence
(Stefano Mizzaro, data scientist)
Hard to predict performance for new problems, for similar problems, even for identical problems.
Be an open-minded researcher & reviewer! Avoid niches.
Thermodynamics 101
Netflix challenge:
predict user ratings
(classification)
Real task:
find something the user will like
(ranking)
SCENARIOS
Show ads about Vodafone
(absolute gain/loss per item)
Estimate Orange online presence
(non informative = useless)
TASK:
text contains "orange":
Does it refer to the telecom?
Reputation monitoring for PR
(if in doubt, keep it)
Amigó et al. 2019, Information Retrieval Journal
Topic Detection for Online Reputation Management:
RQ1: Can we learn similarity functions from annotated data?
RQ2: Can semantic signals be used effectively?
Yes and Yes --> SIGIR full paper
Success? No... Scenario!!!!
Start with the metrics, don't settle for popular choices, and don't stop until you've truly captured task + scenario for your problem (iteration & update probably needed)
reminder Lesson 2: you are dancing with the devil
>
<
Amigó et al, SIGIR 2013
confidence constraint
returning nothing is better than returning bullshit
<
ranks
2nd!
- Four intuitive constraints
- Only BCubed Precision & Recall satisfy all constraints
- Yes, it solves our initial problem :-)
Research on Evaluation stands the test of time better
If in doubt when choosing a metric, a starting point is
ranking: Rank-Biased Precision
clustering: B-Cubed
filtering: how do you want non-informative systems
to be evaluated?
paper
reviewer
accept
leaning to accept
leaning to reject
reject
No specific metric for ordinal classification!!!
(people ignore the rank or assume distances)
F1
Start with the metrics.
Your metrics define your problem. They can be your GPS or your kryptonite
(Lesson 4b: Yes, it can be difficult!)
"Data is not the new oil; it's the new plutonium"
callingbullshit.org
Only if there is
diversity
independence
decentralization
Bias in the Web (Baeza-Yates)
second-order bias
manipulated crowds
Cañamares & Castells, SIGIR 2018: fill the gaps
random sampling =
popularity bias
Cañamares & Castells, SIGIR 2018 best paper award
(Arango et al, SIGIR 2019)
(Gururangan et al, NAACL 2018)
Contradiction via negation
neutral via purpose
Geva et al. 2019: Are we modeling the task or the annotator?
Tasks: Inference, Common Sense QA
Observation: using annotator id as feature improves up to +8%
Lesson 7:
Assume that your data, your data selection and your data annotations are biased. Find the bias!
learned without looking at the premise!!!!
Yang et. al, SIGIR 2019: Is Neural Hype Justified?
Neural Collaborative Filtering (WWW 2017) is established baseline
BUT it is worse than well tuned simple methods
Other DL methods compare to Neural CF as baseline
Finding: 6 out of 7 top DL methods are worse than a simple kNN baseline
Ferrari et al., RecSys 2019: Are we really making Much Progress?
- Contextual Word Embeddings: powerful semantic & syntactic representations
- Solve many tasks at once! ("pre-training")
- Sesame Street & Transformers crossover
- GLUE leaderboard
- Twitter-fast state of the art
Advantages:
- state-of-the-art baselines
- less overfitting
- generalizability
Disadvantages:
- data is buried
- task nuances are buried
- qualitative analysis is buried
- publication system collapses!
- cost of entry:
evaluation in 4 clicks
no registration required
universality
only requires runs & assessments
+ benefits:
full written report (pdf,latex)
metric details, suitability, statistical relevance
tasks & systems repository
Always try to use state of the art as baseline
Also remember
2: keep calm and drink mojito
5: be a reasonable reviewer
unanimous improvement
biased improvement
predicting results in a different test collection
Unanimous Improvement Ratio
- Macro-average F1: arithmetic average (over classes) of the harmonic average of P,R for each class
- and then GLUE averages with other n results from other tasks
"Empirical studies have become challenges to be won, rather than a process for developing insight and understanding"
Remember it is all about insights, understanding and generalizability, not about winning.
And please go beyond averages!
1 It's the evaluation, stupid
2 If a ML algorithm grants you a wish, keep calm and think twice
3 Always remember why you wanted to be a researcher
4 Connect with Social Sciences & Humanities
5 Be an open minded researcher & reviewer! Avoid niches
6 Start with the metrics, don't settle for the popular choices, and don't stop until you have truly captured task + scenario for your problem
(Hint) If in doubt when choosing a metric, a starting point is RBP for ranking, BCubed for clustering, think of non-informative cases for filtering
7: Assume that your data, your data selection and your data annotations are biased. Find the bias!
8: Always try to use the state of the art as baseline
9: Remember it is all about insights, understanding and generalizability, not about winning. And PLEASE go beyond averages!
"Religion is the culture of faith. Science is the culture of doubt" (R. Feynman)