Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

RBU: Rank-Biased Utility (Tsinghua University)

Presentation of the SIGIR'18 paper "An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric" at Tsinghua University (Beijing, China) on Sep 18, 2018. Paper: http://dl.acm.org/citation.cfm?id=3210024
by

Damiano Spina

on 18 September 2018

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of RBU: Rank-Biased Utility (Tsinghua University)

RBU: Rank-Biased Utility
An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric
Enrique Amigó
Damiano Spina
Jorge Carrillo-de-Albornoz

effort
(EU, Expected Utility)

patience parameter
(RBP)

k = min(cutoff, |d|)
Relevance-Oriented Constraints
Diversity-Oriented Constraints
Priority
Swapping items in concordance
with their relevance increases the ranking quality score.
Deepness
Correctly swapping contiguous
items has more effect in early ranking positions.
Deepness
Threshold
Assuming binary relevance, there exists a value n large enough such that, retrieving only one relevant document at the top of the ranking is better than retrieving n relevant documents after n non-relevant documents:
Assuming binary relevance, there exists a value
m
small enough such that retrieving one relevant document in the first position is worse than
m
relevant documents after
m
non-relevant documents.
Closeness
Adding non-relevant documents decreases the score
Confidence
Query
Aspect
Diversity
Redundancy
Monotonic
Redundancy
Aspect
Relevance
Saturation
Aspect
Relevance
Query Aspect Diversity
Covering more about the aspects in the same document
(i.e., without additional effort of inspecting more documents) increases the score.
Redundancy
Adding a document from a less present (less redundant) aspect, increases the score
Monotonic Redundancy
if an aspect t is captured to a greater
extent than a second aspect t' in every previously observed document, then the ranking is more redundant w.r.t. t than t' ′
Aspect Relevance Saturation
Adding rel. documents to an aspect that has been 'sufficiently' covered, does not necessarily increase the score.
Aspect Relevance
Experiments
Some Results
How do we pick the right metric?
Metric to analyze which system performs better for
navigational queries
?
Reciprocal Rank (RR)
and for
Search Result Diversification
?
query: facebook
facebook.com
RR=0.5
RR=1
query aspects:
{ , , , }
Pick the metric that tells what you want to hear
Axiomatic Analysis
Compare metrics against satisfaction of users
metrics
users
We may not be able to prescribe how to design the evaluation metric for our task...




...but we know certain
boundary conditions
(or constraints) of our problem
Metric 2
Metric 1
Primary Goal: Analyze and compare existing diversity metrics
with respect to formal constraints
TREC Web Track 2014
(ad-hoc retrieval, which includes diversification)
30 official runs, 50 topics
Different set of metrics + parameters

Metric Unanimity (MU)
Evaluation of Effectiveness of Information Retrieval Systems
System 1
System 2
System 1
System 2
facebook.com
d1
d2
d3
d4
d5
d1
d2
d3
d4
d5
desirable properties of a metric for our problem (e.g., search result diversification)
aspect weight
(*-IA)

redundancy (ERR)
relevance
M
Aspects with higher weights have more effect in score of the ranking quality.
Result: None of existing metrics satisfy simultaneously the 10 defined constraints
Metric Unanimity
Meta-Evaluation
How to quantify the ability of metrics to capture
relevance + diversity quality characteristics?



Metrics capture different quality criteria.

If a system is better than another system for every quality criteria,
no metric in a set of metrics should say the opposite.
Set of metrics
M
Metric
m
all metrics in M
simultaneously
say that
s1 is better or equal than s2
metric
m
says that
s1 is better or equal than s2
MU(m, M) =
Point-wise
Mutual
Information
(more in the paper)
MU
Metric
Metrics that satisfy only a few constraints has lower scores than the rest of metrics
(variant with highest score reported)
Summary
Which evaluation metric should be used?
Future Work
Further parameter sensitivity analysis of metrics
System 1
d1
query aspects:
{ , , , }
better than
d2
d3
d4
System 2
d2
d1
d3
d4
RBU obtains the highest scores, when p = 0.99 and effort component e > 0
Higher values for the patience parameter p in RBP obtains higher MU scores
Shallower metrics (e.g. early cutoffs) tend to have lower MU scores
(less quality criteria are captured)

None of the existing metrics satisfy all the constraints
Proposed metric
Rank-Biased Utility (RBU)
satisfies all the formal constraints

Experiments: RBU captures more quality criteria than the ones captured by other metrics
Meta-evaluate other criteria, such as sensitivity or robustness against noise
Can metrics that satisfy some constraints be combined automatically to satisfy them all?
query: things to do in Beijing
with family
at night
adventures
to eat
d1 is more relevant
for more aspects
set of aspects
each aspect
t
has a weight
w(t)

ranking d
A metric that captures all quality criteria should
reflect these improvements.
Repeat the process (leave-one-out)
Axiomatic framework
to analyze
diversity metrics
Metric 2
Metric 1
Diversity metrics
(including
RBU
)
available in
EvALL

http://evall.uned.es
Damiano
Enrique
Jorge
'Rank-Biased Utility' by the Funky Metrics Band
Example
Tsinghua University, Beijing, China
September 18, 2018
ISAR RMIT Research group
www.rmit-ir.org
RMIT University, Melbourne, Australia
System 1
d1
query aspects:
{ , , , }
better than
d2
System 2
d1
d3
d2 is relevant
for an unseen aspect
{ , , , }
Full transcript