### Present Remotely

Send the link below via email or IM

CopyPresent to your audience

Start remote presentation- Invited audience members
**will follow you**as you navigate and present - People invited to a presentation
**do not need a Prezi account** - This link expires
**10 minutes**after you close the presentation - A maximum of
**30 users**can follow your presentation - Learn more about this feature in our knowledge base article

# One metric to rule them all: A general Evaluation Measure for Information Access Tasks

Presentation by Enrique Amigó and Julio Gonzalo at Google Zurich, 18 December 2012

by

Tweet## Julio Gonzalo

on 21 December 2012#### Transcript of One metric to rule them all: A general Evaluation Measure for Information Access Tasks

More priority in the system output More priority in the goldstandard They share a cluster in the system output They share a cluster in the goldstandard Notation Input Output topic 1 2 3 ... less important Discarded Ranking Filtering Clustering Includes: Overlapping clustering Document ranking Document filtering General task Clustering related docs. is good Clustering unrelated docs. is bad Noise is better with noise Breaking one cluster is better than breaking n clusters Priority constraint Deepness constraint Deepness threshold constraint Closeness threshold constraint There is a ranking are which is always explored There is a ranking are which is never explored Reliability and Sensitivity How many relevant documents are in the top of the ranking? Definition Rank position The weight for document i is the integral of 1/x^2 from i-1 to i The sum weight should be 1: c1=c Parameterization: the n first documents have the Wn of weight: Two parameters We can integrate R and S over the long tail of documents Overlapping Clustering Similar than the extended Bcubed (Amigó et al. 2009) Sharing too many clusters decreases R Sharing too few clusters decreases S Reliability and Sensitivity map

into Bcubed measures High F(R,S) =>

high score

with all measures High F(R,S) ensures high score according to any measure! In filtering, Reliability/Sensitivity map into

R -> Precision(Neg class)*Precision(Pos class)

S -> Recall(Neg class)*Recall(Pos class) Giving more weight to the first rank positions is less robust (less data) but more strict (it covers P@10) System output Evaluation

measure Prioritizing a relevant document against an irrelevant document increases the score The sum of all document weights should be 1 The integral of a convergent function We want to say that the top n documents carry

e.g. 80% of total weight Rank position C1 C And the weight of d is: Depending on this parameter one relevant doc is better than n relevant docs after n irrelevant docs Breaking cluster with higher priority Breaking a cluster Removing one

doc. instance Adding a doc

relationship Now think big... Imagine we find a single optimal

measure for Document Retrieval Now think bigger... document retrieval document clustering document filtering document organization Imagine we find a single optimal

measure for all these tasks... ... and their combinations! A measure that is simple, intuitive,

sound... and strict with respect to all

other metrics Motivation Clustering & Formal Constraints Need to evaluate mixed problems I'm ill, could you do my work? Of course I have organized all the information for you OK thanks I have organized all the information for you DON'T DO THAT!!

THERE NOT EXISTS AN UNIQUE OBJECTIVE GROUND TRUTH ABOUT ORGANIZING DOCUMENTS!!! We can not ensure that one human produced organization is better than another.... but resembling one expert we increase the probability of helping other experts. similarity to a gold= quality ? Purity/ inv. purity, clustering F

Rand, Jaccard, F&M

Entropy, Mutual information, VI

Edit distance

Bcubed COMPLETENESS HOMOGENEITY RAG BAG CLUSTER SIZE VS QUANITY OK FAIL OK FAIL OK OK FAIL FAIL OK OK OK FAIL FAIL OK FAIL OK OK OK OK OK? Documents at the top of the ranking have more weight More than 100 different measures More than ordering documents An ocean of documents

An ocean of discarded documents

A limited oxigen Top documents Full ranking Publication date How clean are the clusters? How many documents are correctly classified? To what extent a document is

correctly related with the rest? Reliability Precision of relationships in the system output System output Gold Relationships Sensitivity Recall of relationships stated in the gold standard Document Weighting We compute the weighted average of precision over all documents We compute the conditional probability as a sum of product of probabilities. In terms of document weights: We need to compute this

¿Does a relationship in X is reflected in G?, If d and d' are related SEVERAL times in G and X,

then we do not know the relationship matching In ovelapping clustering d and d' can be related

several times in G or X If we assume the best matching d1 d2 d3 d1 d3 d1 d2 SYSTEM OUTPUT GOLD d1 d2 d1 d1 d2 d1 d2 d3 d1 d3 d1 d2 SYSTEM OUTPUT d2 d1 d1 d2 GOLD P(r(d1,d2) in G)=1

P(r(d1,d2) in X)=1/2 It fails in the case of overlapping clustering System A breaks one cluster System B breaks three clusters The trick: R and S compute each document instance separately. Top documents Full ranking Publication date FAIL Top documents Full ranking Publication date FAIL Top documents Full ranking Publication date FAIL Top documents Full ranking Publication date FAIL Confidence constraint Systems should know when they do not know FAIL P@n, RR

Kendall, AUC

MRR, Bpref

MAP, DCG, Q-measure

RBP

R&S DEEPNESS PRIORITY DEEPNESS THRESHOLD CLOSENESS THRESHOLD OK FAIL OK FAIL OK OK FAIL FAIL OK OK OK FAIL FAIL OK FAIL OK OK OK OK OK CONFIDENCE CONTRAINT FAIL FAIL FAIL FAIL FAIL OK OK OK OK OK Three measure families Amount of randomly returned documents Score Reliability F measure

Lam, Mutual Information, Chi coefficient...

Utility, weighted accuracy Reliability and sensitivity do not

satisfy any of them Amount of randomly returned documents Score Class oriented measures Utility measures Informativeness measures Sensitivity Reliability*Sensitivity is

- the most robust measure

- the strictest measure Document Organization Evaluation Campaigns,

beware of your power!

Pick up a measure and then... Measure becomes standard

because people use it People use it because it is standard Popular strategy number 2:

get creative The Hilton-Chihuahua

metric effect! vicious circle! MT evaluation mess Why? Hard Sciences Temperature is what a

thermometer measures Information Access Document Retrieval is what MAP measures? MAP MRR P@10 n-DCG The measure defines the problem:

- An adequate measure is a

fundamental step towards

solving a problem

- An inadequate measure implies

focusing on the wrong problem Popular strategy number 1:

use the most popular metric Why evaluation metrics? State of the Art Popular strategy number 3:

use the simplest Popular strategy number 4:

use the one that says

what I want to hear Idea: formal constraints We may not be able to prescribe

how to design the evaluation function...

But we know what the function has to do

in certain simple boundary conditions

Any suitable metric has to satisfy them Bcubed precision/recall multiplicity constraint extended Bcubed Web People Search A metric problem The task The evaluation

campaign Idea: Formal Constraints Do other humans

agree? 92%

90%

95%

100% Yes! Analysis And the winner is... :-) problem solved! :-( How? OK... But What? ...Well, the moon!

:-) Yes... A single metric to rule

them all! One Metric to Rule Them All: A General Evaluation Measure for Information Access Tasks Enrique Amigó and Julio Gonzalo

nlp.uned.es Formal Constraints New Measures Clustering Filtering Gold standard Problem Unification Retrieval A hint All systems All systems All systems Non-informative

systems Constraints example f(0,t)= 0 f(n,t)= 0 Monitoring Online Reputation (RepLab 2012) ...and we want a single

quality measure! Three mutually exclusive properties Strictness is the maximum difference between a high

score and a low score according to any other measure Robustness is the average correlation

between system scores across topics. R&S for Document Retrieval Conclusions Did we reach the moon?

Almost! :-) We merged Filtering, Retrieval and Clustering into a single "document organization" problem

We defined formal restrictions for document organization

We defined Reliability/Sensitivity, Bcubed-inspired measures that (i) satisfy the desired formal properties (ii) behave well in practice (robustness, strictness)

First suitable single quality measure for combined document organization problems (e.g. clustering + ranking) But:

Computational cost (combinatory explosion of document pairs)

How to weight different types of relationships?

Interpretability of the scale (inherited from BCubed)

Full transcriptinto Bcubed measures High F(R,S) =>

high score

with all measures High F(R,S) ensures high score according to any measure! In filtering, Reliability/Sensitivity map into

R -> Precision(Neg class)*Precision(Pos class)

S -> Recall(Neg class)*Recall(Pos class) Giving more weight to the first rank positions is less robust (less data) but more strict (it covers P@10) System output Evaluation

measure Prioritizing a relevant document against an irrelevant document increases the score The sum of all document weights should be 1 The integral of a convergent function We want to say that the top n documents carry

e.g. 80% of total weight Rank position C1 C And the weight of d is: Depending on this parameter one relevant doc is better than n relevant docs after n irrelevant docs Breaking cluster with higher priority Breaking a cluster Removing one

doc. instance Adding a doc

relationship Now think big... Imagine we find a single optimal

measure for Document Retrieval Now think bigger... document retrieval document clustering document filtering document organization Imagine we find a single optimal

measure for all these tasks... ... and their combinations! A measure that is simple, intuitive,

sound... and strict with respect to all

other metrics Motivation Clustering & Formal Constraints Need to evaluate mixed problems I'm ill, could you do my work? Of course I have organized all the information for you OK thanks I have organized all the information for you DON'T DO THAT!!

THERE NOT EXISTS AN UNIQUE OBJECTIVE GROUND TRUTH ABOUT ORGANIZING DOCUMENTS!!! We can not ensure that one human produced organization is better than another.... but resembling one expert we increase the probability of helping other experts. similarity to a gold= quality ? Purity/ inv. purity, clustering F

Rand, Jaccard, F&M

Entropy, Mutual information, VI

Edit distance

Bcubed COMPLETENESS HOMOGENEITY RAG BAG CLUSTER SIZE VS QUANITY OK FAIL OK FAIL OK OK FAIL FAIL OK OK OK FAIL FAIL OK FAIL OK OK OK OK OK? Documents at the top of the ranking have more weight More than 100 different measures More than ordering documents An ocean of documents

An ocean of discarded documents

A limited oxigen Top documents Full ranking Publication date How clean are the clusters? How many documents are correctly classified? To what extent a document is

correctly related with the rest? Reliability Precision of relationships in the system output System output Gold Relationships Sensitivity Recall of relationships stated in the gold standard Document Weighting We compute the weighted average of precision over all documents We compute the conditional probability as a sum of product of probabilities. In terms of document weights: We need to compute this

¿Does a relationship in X is reflected in G?, If d and d' are related SEVERAL times in G and X,

then we do not know the relationship matching In ovelapping clustering d and d' can be related

several times in G or X If we assume the best matching d1 d2 d3 d1 d3 d1 d2 SYSTEM OUTPUT GOLD d1 d2 d1 d1 d2 d1 d2 d3 d1 d3 d1 d2 SYSTEM OUTPUT d2 d1 d1 d2 GOLD P(r(d1,d2) in G)=1

P(r(d1,d2) in X)=1/2 It fails in the case of overlapping clustering System A breaks one cluster System B breaks three clusters The trick: R and S compute each document instance separately. Top documents Full ranking Publication date FAIL Top documents Full ranking Publication date FAIL Top documents Full ranking Publication date FAIL Top documents Full ranking Publication date FAIL Confidence constraint Systems should know when they do not know FAIL P@n, RR

Kendall, AUC

MRR, Bpref

MAP, DCG, Q-measure

RBP

R&S DEEPNESS PRIORITY DEEPNESS THRESHOLD CLOSENESS THRESHOLD OK FAIL OK FAIL OK OK FAIL FAIL OK OK OK FAIL FAIL OK FAIL OK OK OK OK OK CONFIDENCE CONTRAINT FAIL FAIL FAIL FAIL FAIL OK OK OK OK OK Three measure families Amount of randomly returned documents Score Reliability F measure

Lam, Mutual Information, Chi coefficient...

Utility, weighted accuracy Reliability and sensitivity do not

satisfy any of them Amount of randomly returned documents Score Class oriented measures Utility measures Informativeness measures Sensitivity Reliability*Sensitivity is

- the most robust measure

- the strictest measure Document Organization Evaluation Campaigns,

beware of your power!

Pick up a measure and then... Measure becomes standard

because people use it People use it because it is standard Popular strategy number 2:

get creative The Hilton-Chihuahua

metric effect! vicious circle! MT evaluation mess Why? Hard Sciences Temperature is what a

thermometer measures Information Access Document Retrieval is what MAP measures? MAP MRR P@10 n-DCG The measure defines the problem:

- An adequate measure is a

fundamental step towards

solving a problem

- An inadequate measure implies

focusing on the wrong problem Popular strategy number 1:

use the most popular metric Why evaluation metrics? State of the Art Popular strategy number 3:

use the simplest Popular strategy number 4:

use the one that says

what I want to hear Idea: formal constraints We may not be able to prescribe

how to design the evaluation function...

But we know what the function has to do

in certain simple boundary conditions

Any suitable metric has to satisfy them Bcubed precision/recall multiplicity constraint extended Bcubed Web People Search A metric problem The task The evaluation

campaign Idea: Formal Constraints Do other humans

agree? 92%

90%

95%

100% Yes! Analysis And the winner is... :-) problem solved! :-( How? OK... But What? ...Well, the moon!

:-) Yes... A single metric to rule

them all! One Metric to Rule Them All: A General Evaluation Measure for Information Access Tasks Enrique Amigó and Julio Gonzalo

nlp.uned.es Formal Constraints New Measures Clustering Filtering Gold standard Problem Unification Retrieval A hint All systems All systems All systems Non-informative

systems Constraints example f(0,t)= 0 f(n,t)= 0 Monitoring Online Reputation (RepLab 2012) ...and we want a single

quality measure! Three mutually exclusive properties Strictness is the maximum difference between a high

score and a low score according to any other measure Robustness is the average correlation

between system scores across topics. R&S for Document Retrieval Conclusions Did we reach the moon?

Almost! :-) We merged Filtering, Retrieval and Clustering into a single "document organization" problem

We defined formal restrictions for document organization

We defined Reliability/Sensitivity, Bcubed-inspired measures that (i) satisfy the desired formal properties (ii) behave well in practice (robustness, strictness)

First suitable single quality measure for combined document organization problems (e.g. clustering + ranking) But:

Computational cost (combinatory explosion of document pairs)

How to weight different types of relationships?

Interpretability of the scale (inherited from BCubed)