output

topic 1

2

3

...

less importance

discarded

Document Clustering

Document Retrieval

Document Filtering

Document Organization

Clustering related docs. is good

Clustering unrelated docs. is bad

Noise is better with noise

Breaking one cluster is better than breaking n clusters

Priority constraint

Deepness constraint

Deepness threshold constraint

Closeness threshold constraint

There is a ranking area which is always explored

There is a ranking area which is never explored

Reliability and Sensitivity

High-Level Definition

Overlapping Clustering

Similar than the extended Bcubed (Amigó et al. 2009)

Sharing too many clusters decreases R

Sharing too few clusters decreases S

High F(R,S) ensures high score according to any measure

R -> Precision(Neg class)*Precision(Pos class)

S -> Recall(Neg class)*Recall(Pos class)

Prioritizing a relevant document wrt an irrelevant document must increase the score

The sum of all document weights should be 1

-> integral of a convergent function

precision/recall bias: we want to say that the top

n

documents carry

W%

of total weight

document rank position

Idea: Formal Constraints

Need to evaluate mixed problems!

Purity/ inv. purity,

clustering F

Rand,

Jaccard,

F&M

Entropy,

Mutual information

, VI

Edit distance

Bcubed

COMPLETENESS

HOMOGENEITY

RAG BAG

CLUSTER SIZE VS QUANITY

OK

FAIL

OK

FAIL

OK

OK

FAIL

FAIL

OK

OK

OK

FAIL

FAIL

OK

FAIL

OK

OK

OK

OK

OK?

Documents at the top of the ranking have more weight

More than 100 different measures

More than ordering documents

An ocean of documents

An ocean of discarded documents

A limited oxigen

Top documents

Full ranking

Publication date

To what extent a document is

correctly related with the rest?

Reliability

Precision of relationships in the system output

System output

Gold

Relationships

Sensitivity

Recall of relationships stated in the gold standard

The detail: weights

Not all relationships & documents have the same importance: need weights!

We need to compute this

¿Does a relationship in X is reflected in G?,

If d and d' are related SEVERAL times in G and X,

then we do not know the relationship matching

In ovelapping clustering d and d' can be related

several times in G or X

If we assume the best matching

d1

d2

d3

d1

d3

d1

d2

SYSTEM OUTPUT

GOLD

d2

d1

d1

d2

d1

d2

d3

d1

d3

d1

d2

SYSTEM OUTPUT

d2

d1

d1

d2

GOLD

P(r(d1,d2) in G)=1

P(r(d1,d2) in X)=1/2

Top documents

Full ranking

Publication date

FAIL

Top documents

Full ranking

Publication date

FAIL

Top documents

Full ranking

Publication date

FAIL

Top documents

Full ranking

Publication date

FAIL

Confidence constraint

No information is better than wrong information

FAIL

P@n, RR

Kendall, AUC

MRR, Bpref

MAP, DCG, Q-measure

RBP

???????

DEEPNESS

PRIORITY

DEEPNESS THRESHOLD

CLOSENESS THRESHOLD

OK

FAIL

OK

FAIL

OK

OK

FAIL

FAIL

OK

OK

OK

FAIL

FAIL

OK

FAIL

OK

OK

OK

OK

OK

CONFIDENCE CONTRAINT

FAIL

FAIL

FAIL

FAIL

FAIL

OK

OK

OK

OK

OK

Three measure families

F measure

Lam, Mutual Information, Chi coefficient...

Utility, weighted accuracy

Amount of randomly returned documents

Score

Class oriented measures

Utility measures

Informativeness measures

F(R,S) is

- the most robust measure

- the strictest measure

**"Document Organization"**

We may not be able to prescribe

how to design the evaluation function...

But we know what the function has to do

in certain simple boundary conditions

Any suitable metric has to satisfy them

**One Metric to Rule Them All: A General Evaluation Measure for Information Access Tasks**

**Enrique Amigó, Julio Gonzalo and Felisa Verdejo @ SIGIR 2013**

nlp.uned.es

nlp.uned.es

**Formal Constraints**

**New Measures**

**Clustering**

**Filtering**

**Problem Unification**

**Retrieval**

A hint

All systems

Non-informative

systems

Constraints

f(0,t)= 0

f(n,t)= 0

Ex: Online Reputation Monitoring

Three mutually exclusive properties

Strictness is the maximum difference between a high

score and a low score according to any other measure

Robustness is the average correlation

between system scores across topics.

**Motivation**

Hard Sciences

Temperature is what a

thermometer measures

Information Access

Document Retrieval is what MAP measures?

MAP

MRR

P@10

n-DCG

The measure defines the problem:

- An adequate measure is a

fundamental step towards

solving a problem

- An inadequate measure implies

focusing on the wrong problem

Why?

Filtering

Clustering

Ranking

Success story:

Amigó et al. (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints, Information Retrieval 12 (4).

Conclusions

Computational cost for the full task

How to weight different types of relationships?

Too many parameters? (n, W, alpha)

What?

System output

Evaluation

measure

Gold standard

**Definition**

**Results**

**Clustering**

**Retrieval**

**Filtering**

Giving more weight to the first rank positions is less robust (less data) but more strict (it covers P@10)

Constraints

Analysis of existing metrics

Analysis

40+ metrics, 4 constraints -> One metric!

Yes... A single metric to rule

them all :-)

Find a single optimal measure for all these problems...

... and their combinations

Analize and compare existing IR metrics with respect to formal constraints

Find a single optimal evaluation measure for Document Retrieval

document weight

document rank position

document weight

n=2

W=80

1. Adding Weights

2. Choosing weights

Weight parameters

final formula:

Reliability = BCubed Precision

Sensitivity = BCubed Recall

Retrieval + Clustering + Filtering -> "document organization"

Formal restrictions

Reliability/Sensitivity:

First single quality measure for mixed problems

satisfies the desired formal properties

flexible (e.g. P@10 -> R,S with n=10, W=99)

behaves well in every task (robustness, strictness)

(Amigó et al. 2009)

relation in system output

relation in gold standard

for relationships in the output

probability of finding them

in the gold standard

weight of the relationship

F(R,S)

F(R,S)

F(R,S)

F(R,S)

Constraint Analysis

It depends on the weight of d, d'

The sum of all document weights is 1.

We can integrate over the long tail.

We can state that the first n documents carry the W% of weight in the evaluation.

Reliability=Precision of relationships

Sensitivity=Recall of relationships

Code available at nlp.uned.es

;-)

Reliability & Sensitivity likely to be

compulsory for SIGIR 2014

BCubed Precision/Recall

Evaluation Campaigns,

beware of your power!

Pick up a measure and then...

Measure becomes standard

because people use it

People use it because it is standard

Popular strategy number 2:

get creative

The Hilton-Chihuahua

metric effect!

vicious circle!

MT evaluation mess

Popular strategy number 1:

use the most popular metric

State of the Art

Popular strategy number 3:

use the simplest

Popular strategy number 4:

use the one that says

what I want to hear

computed per item

BCubed-inspired plus:

+ weight that decreases with ranking depth

+ parameters to establish that the first n documents carry X% of the weight

**Padova, May 8 2014**

relation in system output

relation in gold standard