Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

One metric to rule them all: A general Evaluation Measure for Document Organization Tasks

Presentation by Julio Gonzalo at University of Padova, May 2014
by

Julio Gonzalo

on 23 October 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of One metric to rule them all: A general Evaluation Measure for Document Organization Tasks

input
output
topic 1
2
3
...
less importance
discarded
Document Clustering
Document Retrieval
Document Filtering
Document Organization
Clustering related docs. is good
Clustering unrelated docs. is bad
Noise is better with noise
Breaking one cluster is better than breaking n clusters
Priority constraint
Deepness constraint
Deepness threshold constraint
Closeness threshold constraint
There is a ranking area which is always explored
There is a ranking area which is never explored
Reliability and Sensitivity
High-Level Definition
Overlapping Clustering
Similar than the extended Bcubed (Amigó et al. 2009)
Sharing too many clusters decreases R
Sharing too few clusters decreases S
High F(R,S) ensures high score according to any measure
R -> Precision(Neg class)*Precision(Pos class)
S -> Recall(Neg class)*Recall(Pos class)
Prioritizing a relevant document wrt an irrelevant document must increase the score
The sum of all document weights should be 1
-> integral of a convergent function
precision/recall bias: we want to say that the top
n
documents carry
W%
of total weight
document rank position
Idea: Formal Constraints
Need to evaluate mixed problems!
Purity/ inv. purity,
clustering F
Rand,
Jaccard,
F&M
Entropy,
Mutual information
, VI
Edit distance
Bcubed
COMPLETENESS
HOMOGENEITY
RAG BAG
CLUSTER SIZE VS QUANITY
OK
FAIL
OK
FAIL
OK
OK
FAIL
FAIL
OK
OK
OK
FAIL
FAIL
OK
FAIL
OK
OK
OK
OK
OK?
Documents at the top of the ranking have more weight
More than 100 different measures
More than ordering documents
An ocean of documents
An ocean of discarded documents
A limited oxigen
Top documents
Full ranking
Publication date
To what extent a document is
correctly related with the rest?
Reliability
Precision of relationships in the system output
System output
Gold
Relationships
Sensitivity
Recall of relationships stated in the gold standard
The detail: weights
Not all relationships & documents have the same importance: need weights!
We need to compute this
¿Does a relationship in X is reflected in G?,
If d and d' are related SEVERAL times in G and X,
then we do not know the relationship matching
In ovelapping clustering d and d' can be related
several times in G or X
If we assume the best matching
d1
d2
d3
d1
d3
d1
d2
SYSTEM OUTPUT
GOLD
d2
d1
d1
d2
d1
d2
d3
d1
d3
d1
d2
SYSTEM OUTPUT
d2
d1
d1
d2
GOLD
P(r(d1,d2) in G)=1
P(r(d1,d2) in X)=1/2
Top documents
Full ranking
Publication date
FAIL
Top documents
Full ranking
Publication date
FAIL
Top documents
Full ranking
Publication date
FAIL
Top documents
Full ranking
Publication date
FAIL
Confidence constraint
No information is better than wrong information
FAIL
P@n, RR
Kendall, AUC
MRR, Bpref
MAP, DCG, Q-measure
RBP
???????
DEEPNESS
PRIORITY
DEEPNESS THRESHOLD
CLOSENESS THRESHOLD
OK
FAIL
OK
FAIL
OK
OK
FAIL
FAIL
OK
OK
OK
FAIL
FAIL
OK
FAIL
OK
OK
OK
OK
OK
CONFIDENCE CONTRAINT
FAIL
FAIL
FAIL
FAIL
FAIL
OK
OK
OK
OK
OK
Three measure families
F measure
Lam, Mutual Information, Chi coefficient...
Utility, weighted accuracy
Amount of randomly returned documents
Score
Class oriented measures
Utility measures
Informativeness measures
F(R,S) is
- the most robust measure
- the strictest measure
"Document Organization"
We may not be able to prescribe
how to design the evaluation function...

But we know what the function has to do
in certain simple boundary conditions

Any suitable metric has to satisfy them
One Metric to Rule Them All: A General Evaluation Measure for Information Access Tasks
Enrique Amigó, Julio Gonzalo and Felisa Verdejo @ SIGIR 2013
nlp.uned.es

Formal Constraints
New Measures
Clustering
Filtering
Problem Unification
Retrieval
A hint
All systems
Non-informative
systems
Constraints
f(0,t)= 0
f(n,t)= 0
Ex: Online Reputation Monitoring
Three mutually exclusive properties
Strictness is the maximum difference between a high
score and a low score according to any other measure
Robustness is the average correlation
between system scores across topics.
Motivation
Hard Sciences
Temperature is what a
thermometer measures
Information Access
Document Retrieval is what MAP measures?
MAP
MRR
P@10
n-DCG
The measure defines the problem:

- An adequate measure is a
fundamental step towards
solving a problem

- An inadequate measure implies
focusing on the wrong problem
Why?
Filtering
Clustering
Ranking
Success story:
Amigó et al. (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints, Information Retrieval 12 (4).
Conclusions
Computational cost for the full task
How to weight different types of relationships?
Too many parameters? (n, W, alpha)
What?
System output
Evaluation
measure
Gold standard
Definition
Results
Clustering
Retrieval
Filtering
Giving more weight to the first rank positions is less robust (less data) but more strict (it covers P@10)
Constraints
Analysis of existing metrics
Analysis
40+ metrics, 4 constraints -> One metric!
Yes... A single metric to rule
them all :-)
Find a single optimal measure for all these problems...
... and their combinations
Analize and compare existing IR metrics with respect to formal constraints
Find a single optimal evaluation measure for Document Retrieval
document weight
document rank position
document weight
n=2
W=80
1. Adding Weights
2. Choosing weights
Weight parameters
final formula:
Reliability = BCubed Precision
Sensitivity = BCubed Recall
Retrieval + Clustering + Filtering -> "document organization"
Formal restrictions
Reliability/Sensitivity:
First single quality measure for mixed problems
satisfies the desired formal properties
flexible (e.g. P@10 -> R,S with n=10, W=99)
behaves well in every task (robustness, strictness)
(Amigó et al. 2009)
relation in system output
relation in gold standard
for relationships in the output
probability of finding them
in the gold standard
weight of the relationship
F(R,S)
F(R,S)
F(R,S)
F(R,S)
Constraint Analysis
It depends on the weight of d, d'
The sum of all document weights is 1.
We can integrate over the long tail.
We can state that the first n documents carry the W% of weight in the evaluation.
Reliability=Precision of relationships
Sensitivity=Recall of relationships
Code available at nlp.uned.es
;-)
Reliability & Sensitivity likely to be
compulsory for SIGIR 2014
BCubed Precision/Recall
Evaluation Campaigns,
beware of your power!
Pick up a measure and then...
Measure becomes standard
because people use it
People use it because it is standard
Popular strategy number 2:
get creative
The Hilton-Chihuahua
metric effect!
vicious circle!
MT evaluation mess
Popular strategy number 1:
use the most popular metric
State of the Art
Popular strategy number 3:
use the simplest
Popular strategy number 4:
use the one that says
what I want to hear
computed per item
BCubed-inspired plus:
+ weight that decreases with ranking depth
+ parameters to establish that the first n documents carry X% of the weight
Padova, May 8 2014
relation in system output
relation in gold standard
Full transcript