Prezi

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in the manual

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Three Way Search Engine Queries with Multi-feature Document Comparison

PAN 2012 presentation (see pan.webis.de)
by Jan Kasprzak on 18 September 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Three Way Search Engine Queries with Multi-feature Document Comparison

Three Way Search Engine Queries
Multi-feature
Document Comparison
with
Šimon Suchomel
and
Jan Kasprzak
Faculty of Informatics
Masaryk University
Three types of queries
Detailed Comparison
Keywords
Intrinsic plagiarism
Headers
Algorithm
Performance
Conclusions
Multiple types
of features

Tokenization
Common
Features
Valid intervals
Postprocessing
KW extraction based on tf-idf
KW enriched with collocations
non positionable
unconditionally executed
4 KW queries per doc on average
covering the whole document
41 minutes
14 seconds!
Purely based on
(offset, length)

Suspicious document
All About Cell Phone

More and more people have been using cell phone nowadays. In fact, some
people even consider it as a necessity and we can't blame them. Cell phone
is very useful and helpful for many after all.

What is Cell Phone?
Unless you're living under a rock, you do know what a cell phone is. But
for the uninitiated, here's what it means. A mobile phone (also known as a
...
Differences Between Cell Phones

The cell phone itself isn't an ongoing expense. You will need to shell out
money for this only once unless you want an upgrade. Your own personal
taste, style and need should determine what cell phone you want (not the
sales assistant behind the desk).
...
The use of "hands-free" was not recommended by the British Consumers'
Association in a statement in November 2000 as they believed that exposure
was increased. However, measurements for the (then) UK Department of Trade
and Industry and others for the French l Agence fran aise de s curit
sanitaire environmental showed substantial reductions. In 2005 Professor
Lawrie Challis and others said clipping a ferrite bead onto hands-free kits
stops the radio waves travelling up the wire and into the head.

Overall Health Risks
Many scientific studies have investigated possible health effects of mobile
phone radiations. These studies are occasionally reviewed by some
scientific committees to assess overall risks. The most recent assessment
was published in 2007 by the European Commission Scientific Committee on
Emerging and Newly Identified Health Risks (SCENIHR). It concludes from the
available research that no significant health effect has been demonstrated
from mobile phone radiation at normal exposure levels:
Sequences of
Unicode letters
Words
24-core server
4x AMD 8139
e.g.: mobile phone signal prepaid carriers
Word
5-grams
Stopword
8-grams
What do you get if you multiply six by nine?
3+ characters,
sorted
get-multiply-what-you-you
get-multiply-six-you-you
get-multiply-nine-six-you
42
[Stamatatos, 2011]
50 most-frequent
English words
unsorted
e.g.: in professor lawrie challis clipping ferrite
e.g.: Differences Between Cell Phones
positionable
conditionally executed
covering the suspicious passage
based on Average Word Frequency Class
it's change indicates suspicious passage
naive headers detection
positionable
conditionally executed
covering the part introduces by the header
No strict ordering
Ordered features
suspicious
source
suspicious
source
Identify by (offset, length)!
4


6
of common features
Overlapping
detections
Neighbouring
detections
Heuristics for merging
Up to 30,000 characters gap
biased
for recall
> 300 characters?
NO
YES
Drop both
Keep the longer one
Narrowed search
KW, Intrinsic, Headers
with befitting aiming and combination

word 5-gram
stopword 8-gram
Standard approach
that's
several
pages!
[Kasprzak, 2009]
[Stamatatos, 2011]
Implementation
Pure Perl
669 lines of code
Not by an ordinal number
n't the of and a in to is was it for with he be on i that by at you 's are not his this from but had which she they or an were we their been has have will would her there can all as if who what said
O(# of documents)
O(# of document pairs)
with Michal Brandejs
See the full transcript