Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

ICMR 2013

Retrieving geo-location of videos with a divide & conquer hierarchical multimodal approach
by

Michele Trevisiol

on 8 May 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of ICMR 2013

Photo Credit:
NASA Earth Observatory/NOAA NGDC

Michele Trevisiol
UPF, Spain

Retrieving geo-location of videos with a divide & conquer hierarchical multimodal approach
Our Tag Filtering
Tags Pre-Processing
Future Work & Conclusions
Jonathan Delhumeau
INRIA, France

Hervé Jégou
INRIA, France

Guillaume Gravier
IRISA, France

ICMR 2013, April 16 - 19, Dallas, Texas (USA)
Tags Weighting
numeric
tags (e.g. date, year)
numeric characters from the alphanumeric tags
stop-words
dictionary
common words (travel, geotag, birthday, etc.)
camera/device (iPhone, camera, Canon, etc.)
J. Whissell and C. Clarke,
Improving document clstering using Okapi BM25 feature weighting
. Information Retrieval, 14, 2011.
About Okapi BM25
Okapi BM25
Following
Whissell and Clarke
, we experimented different valus of
k
for both our steps.
IR-Matrix Method
we quantify the coordinates on a cell grid of 0.1º
e.g. (43.71268, 10.4148) belongs to (43.7, 10.4)
Given a test video with tags
cells
tags
we compute a weighted co-occurrence (tags-cells) matrix
associating each tag with the grid cell it appears in
Divide & Conquer Approach
we define in which
grid cell
it is likely to belong
then, we consider all the coordinates of all the previous videos/images located in that area(s)
we define which
coordinates
has more related tags with the test video
we return the
coordinates
(if more than one, we apply "medoid")
we find the best cell(s)
each query vector is multiplied by the weighted
tags-cells matrix to find the most probable cell(s)
weights process
Okapi BM25 feature weighting
smoothing with signed SQRT
normalized (L2-norm)
redundant reduction (whitening)
tags-coordinates matrix
Each query vector is multiplied by the weighted tag-coordinates matrix to find the most probable coordinates
(latitude, longitude)
how
how
2)
1)
identify which tags are more "geo-descriptive"
Tags Weighting
[
tag
|
frequency
|
avg_distance
]
sorted by frequency
tag
frequency
average distance
(no. of times that it appears in the whole training set)
(avg. distance among all the coordinates associated)
Future Work
generalize the proposed tags weighting scheme
(defining some specific criteria)
investigate alternative methods to group the coordinates in areas for the 1st step
(e.g., clustering)
integrate external information
(gazetteers, Wordnet, API of GMaps, etc.)
Weighting Scheme
The following heuristic is used to identify
how
geo-descriptive
a tag is:
Example of tags with the top weight
frequency > 200
10 < avgDistance < 50
all the tags
tags that respect the constraints:
On the training set:
For each selected cell:
Which sources can we exploit?
Introduction
Given a set of
flick
r
videos.
What we do:
determine the location (latitude/longitude)
Hierarchical and Multimodal Approach
Overview
Datasets
We
did not
use any external information like: GeoNames, WordNet, Wikipedia, Google Maps, etc.
Geo Tagging
Why?
text annotations:
i.e., title, description, tags
user's information
:
i.e., hometown, previous uploaded medias
user's social contacts
and their uploaded medias
last but not least:
visual information
from the image/video.
Organized by
Adam Rae
and
Pascal Kelm
train set: images (~3.2M) and videos (~15K)
test set: 4182 videos
Datasets (by MediaEval Placing Task 2012)
Evaluations
Great Circle distances
(Haversine) between estimated and real location.
i.e.,
How many videos identified inside 1km, 10km, 100km, 1000km, and 10000km.

Ground Truth
is supplied by Flickr users at upload time.
Presence of tags
before
our tags pre-processing:

Train Set:
images(3.2M) + videos(15K)
14.2% -> not tags
0.9% -> one tag only

Test Set:
videos (4182)
45.5% -> not tags
3.3% -> one tag only
"beach"
"nature"
"iphone"
"california"
"new york"
"italy"
Flickr Tag Normalization
input: "the Sagrada Familia, Barcelona, trip 2010"
output: "sagradafamilia barcelona trip2010
H. Jégou and O. Chum,
Negative evidences and co-occurrences in image retrieval: the benefit of PCA and whitening
.
ECCV, Oct. 2012
in cascade we try different solutions.
If there are no tags?
1) user's upload history
based on user common location: most frequent location of media uploaded (present in the training set)
based on user's social connections most common location (uploaded media)
2) social network extension
3) user's hometown
defined by the user in textual format. e.g., "San Francisco, California", "New York", etc.
4) content-based approach
5) global prior-location
fixed location computed 'a priori' among all the coordinates in the training set
H. Jégou, F. Perrronin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid.
Aggreating local image descriptors into compact codes
. PAMI, Sep. 2012.
many indoor scenes, not enough matches for all the possible location, ..
standard: k = [1.2,2.0] and b = 0.75
1
1
Cumulative values of correctly detected locations for methods in pipeline:
percentage of video founds
(y-axis) in a
radius of x km
(x-axis).
Final Evaluation
Results
Cumulative correctly detected locations:
percentage of corrected locations
(y-axis) in a
radius of x km
(x-axis).
We compare our IR-Matrix Pipeline with:
The best participants of MediaEval 2012:
Conclusions
we could
not exploit the visual information enough
due to the type of the data
we found that
tags are important
, but
user's information

surprised us
! (knowledge retrieved by upload history, hometown, etc.)
Example of tags weights
Related Work in the Community
Extending the knowledge around the media object:
improve media search,
classification,
location-specific information,
etc.
Estimating the geographic scope of only piece of text (tweets, queries)

Given some words (e.g., tags) define with of them are geo-descriptive
Those are images with tags taken from the training set
IR-Matrix Method (briefly)
Divide & Conquer Approach
we select the best matching
grid cell
we select the
coordinates
that more likely are associated to the input tags
2)
1)
Given the tags of the input media
spain valencia canon40D
sunset mysite spirit photography
tags weighting
select the grid cell with the
highest probability
to contain the given tags
(39.45629, -0.35199)
matched tags:
spain
valencia
science
Our Baseline: IR-Frequency
Our first attempt to exploit the tags.

Same pipeline and multimodal approach of our method.
simpler tags spreading filtering (discarding tags)
each train image/video is a document
given a test video -> get the train document with the highest number of common tags (priority to machine tags)
1
2
3
IR-Frequency Pipeline (our baseline)
Berlin
(BN) P. Kelm, S. Schmiedeke, T. Sikora
Unicamp
(UP) L. Tzy Li, J. Almeida, D. Carlos G. Pedronette, O. Penatti, R.S. Torres
Berkeley
(BY) J. Choi, V. Ekambaram, G. Friedland, K. Ramchandran
CEA/List
(CL) A. Popescu, N. Ballas
Ghent/Cardiff
(GC) O.van Laere, S. Schockaert, J. Quinn, F. Langbein, B. Dhoedt
"Nearest Neighbors approach based on text (tags)"
Cumulative values of correctly detected locations:
percentage of video founds
(y-axis) in a
radius of x km
(x-axis).
MediaEval Workshop
Placing Task's works
Hays and Efros, CVPR, 2008.
Purely visual approach
Penatti et al., ICMR, 2012.
Bag of scenes, saving semantic information
Xirong Li et al., ICMR, 2012.
Visual concept detection + geo context
Serdyukov et al., SIGIR, 2009.
Cell Grids, language model, neighbors influence
O'Hare and Murdock, Information Retrieval, 2012.
Cell Grid, language models based approach
Sergieh et al., ICMR, 2012.
Relevant tag estimation
Crandall et al., WWW, 2009.
Visual + Text, two levels of granularity (city/landmark)
Outcome: visual approach alone is not reliable enough. The text needs to be exploited.
About
60%
of the test videos do not contain tags after our pre-processing.
Given the
tags
of the input media
coordinates
tags
weights process
Okapi BM25 feature weighting
smoothing with signed SQRT
normalized (L2-norm)
redundant reduction (whitening)
On the training set:
H. Jégou and O. Chum,
Negative evidences and co-occurrences in image retrieval: the benefit of PCA and whitening
.
ECCV, Oct. 2012
we create a weighted co-occurrence matrix
associating each tag with the coordinates it belongs to
"Near Neighbor approach based on text"
About
60%
of the test videos do not contain tags after our tags filtering.
Cumulative values of correctly detected locations for methods in pipeline:
percentage of video founds
(y-axis) in a
radius of x km
(x-axis).
Full transcript