Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


One metric to rule them all: A general Evaluation Measure for Information Access Tasks

Presentation by Enrique Amigó and Julio Gonzalo at Google Zurich, 18 December 2012

Julio Gonzalo

on 21 December 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of One metric to rule them all: A general Evaluation Measure for Information Access Tasks

More priority in the system output More priority in the goldstandard They share a cluster in the system output They share a cluster in the goldstandard Notation Input Output topic 1 2 3 ... less important Discarded Ranking Filtering Clustering Includes: Overlapping clustering Document ranking Document filtering General task Clustering related docs. is good Clustering unrelated docs. is bad Noise is better with noise Breaking one cluster is better than breaking n clusters Priority constraint Deepness constraint Deepness threshold constraint Closeness threshold constraint There is a ranking are which is always explored There is a ranking are which is never explored Reliability and Sensitivity How many relevant documents are in the top of the ranking? Definition Rank position The weight for document i is the integral of 1/x^2 from i-1 to i The sum weight should be 1: c1=c Parameterization: the n first documents have the Wn of weight: Two parameters We can integrate R and S over the long tail of documents Overlapping Clustering Similar than the extended Bcubed (Amigó et al. 2009) Sharing too many clusters decreases R Sharing too few clusters decreases S Reliability and Sensitivity map
into Bcubed measures High F(R,S) =>
high score
with all measures High F(R,S) ensures high score according to any measure! In filtering, Reliability/Sensitivity map into

R -> Precision(Neg class)*Precision(Pos class)
S -> Recall(Neg class)*Recall(Pos class) Giving more weight to the first rank positions is less robust (less data) but more strict (it covers P@10) System output Evaluation
measure Prioritizing a relevant document against an irrelevant document increases the score The sum of all document weights should be 1 The integral of a convergent function We want to say that the top n documents carry
e.g. 80% of total weight Rank position C1 C And the weight of d is: Depending on this parameter one relevant doc is better than n relevant docs after n irrelevant docs Breaking cluster with higher priority Breaking a cluster Removing one
doc. instance Adding a doc
relationship Now think big... Imagine we find a single optimal
measure for Document Retrieval Now think bigger... document retrieval document clustering document filtering document organization Imagine we find a single optimal
measure for all these tasks... ... and their combinations! A measure that is simple, intuitive,
sound... and strict with respect to all
other metrics Motivation Clustering & Formal Constraints Need to evaluate mixed problems I'm ill, could you do my work? Of course I have organized all the information for you OK thanks I have organized all the information for you DON'T DO THAT!!
THERE NOT EXISTS AN UNIQUE OBJECTIVE GROUND TRUTH ABOUT ORGANIZING DOCUMENTS!!! We can not ensure that one human produced organization is better than another.... but resembling one expert we increase the probability of helping other experts. similarity to a gold= quality ? Purity/ inv. purity, clustering F
Rand, Jaccard, F&M
Entropy, Mutual information, VI
Edit distance
Bcubed COMPLETENESS HOMOGENEITY RAG BAG CLUSTER SIZE VS QUANITY OK FAIL OK FAIL OK OK FAIL FAIL OK OK OK FAIL FAIL OK FAIL OK OK OK OK OK? Documents at the top of the ranking have more weight More than 100 different measures More than ordering documents An ocean of documents
An ocean of discarded documents
A limited oxigen Top documents Full ranking Publication date How clean are the clusters? How many documents are correctly classified? To what extent a document is
correctly related with the rest? Reliability Precision of relationships in the system output System output Gold Relationships Sensitivity Recall of relationships stated in the gold standard Document Weighting We compute the weighted average of precision over all documents We compute the conditional probability as a sum of product of probabilities. In terms of document weights: We need to compute this
¿Does a relationship in X is reflected in G?, If d and d' are related SEVERAL times in G and X,
then we do not know the relationship matching In ovelapping clustering d and d' can be related
several times in G or X If we assume the best matching d1 d2 d3 d1 d3 d1 d2 SYSTEM OUTPUT GOLD d1 d2 d1 d1 d2 d1 d2 d3 d1 d3 d1 d2 SYSTEM OUTPUT d2 d1 d1 d2 GOLD P(r(d1,d2) in G)=1
P(r(d1,d2) in X)=1/2 It fails in the case of overlapping clustering System A breaks one cluster System B breaks three clusters The trick: R and S compute each document instance separately. Top documents Full ranking Publication date FAIL Top documents Full ranking Publication date FAIL Top documents Full ranking Publication date FAIL Top documents Full ranking Publication date FAIL Confidence constraint Systems should know when they do not know FAIL P@n, RR
Kendall, AUC
MRR, Bpref
MAP, DCG, Q-measure
Lam, Mutual Information, Chi coefficient...
Utility, weighted accuracy Reliability and sensitivity do not
satisfy any of them Amount of randomly returned documents Score Class oriented measures Utility measures Informativeness measures Sensitivity Reliability*Sensitivity is
- the most robust measure
- the strictest measure Document Organization Evaluation Campaigns,
beware of your power!
Pick up a measure and then... Measure becomes standard
because people use it People use it because it is standard Popular strategy number 2:
get creative The Hilton-Chihuahua
metric effect! vicious circle! MT evaluation mess Why? Hard Sciences Temperature is what a
thermometer measures Information Access Document Retrieval is what MAP measures? MAP MRR P@10 n-DCG The measure defines the problem:

- An adequate measure is a
fundamental step towards
solving a problem

- An inadequate measure implies
focusing on the wrong problem Popular strategy number 1:
use the most popular metric Why evaluation metrics? State of the Art Popular strategy number 3:
use the simplest Popular strategy number 4:
use the one that says
what I want to hear Idea: formal constraints We may not be able to prescribe
how to design the evaluation function...

But we know what the function has to do
in certain simple boundary conditions

Any suitable metric has to satisfy them Bcubed precision/recall multiplicity constraint extended Bcubed Web People Search A metric problem The task The evaluation
campaign Idea: Formal Constraints Do other humans
agree? 92%



100% Yes! Analysis And the winner is... :-) problem solved! :-( How? OK... But What? ...Well, the moon!
:-) Yes... A single metric to rule
them all! One Metric to Rule Them All: A General Evaluation Measure for Information Access Tasks Enrique Amigó and Julio Gonzalo
nlp.uned.es Formal Constraints New Measures Clustering Filtering Gold standard Problem Unification Retrieval A hint All systems All systems All systems Non-informative
systems Constraints example f(0,t)= 0 f(n,t)= 0 Monitoring Online Reputation (RepLab 2012) ...and we want a single
quality measure! Three mutually exclusive properties Strictness is the maximum difference between a high
score and a low score according to any other measure Robustness is the average correlation
between system scores across topics. R&S for Document Retrieval Conclusions Did we reach the moon?
Almost! :-) We merged Filtering, Retrieval and Clustering into a single "document organization" problem
We defined formal restrictions for document organization
We defined Reliability/Sensitivity, Bcubed-inspired measures that (i) satisfy the desired formal properties (ii) behave well in practice (robustness, strictness)
First suitable single quality measure for combined document organization problems (e.g. clustering + ranking) But:
Computational cost (combinatory explosion of document pairs)
How to weight different types of relationships?
Interpretability of the scale (inherited from BCubed)
Full transcript