Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Υπηρεσία Εντοπισμού Λογοκλοπής – 9.14

No description
by

Dimitris Antonakis

on 24 January 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Υπηρεσία Εντοπισμού Λογοκλοπής – 9.14

Υπηρεσία Εντοπισμού Λογοκλοπής

9.14
Δημήτρης Γαβρίλης
Δημήτρης Αντωνάκης
Part 1 - Introduction
Detect plagiarism on grey literature and publications found in the Greek Academic community
Gather and index all theses from member libraries
Establish a framework for integrating new content, checking for plagiarism, etc.
Integrate into each library’s deposit workflows

Overview
Challenges
Integrate system into academic library software
Simplify the process
Handle large amount of content
Process each request quickly

Access
Web based UI
REST API
OAI-PMH harvester

Part 2 – System Details
Scan time (ms) per:
Window size & Step size

Part 3 – Matching Algorithm
Part 4 – Experimental Results
Overall Process
Text extraction problems
Text cannot extracted from documents because:
Author has the document copy-protected
The fonts are not exported
The document contains scanned images of text
The document is corrupted
The document contains special characters that lead to the text-extraction app to crash

Successful text extraction
Sample of 147 documents
-Various departments
-Various users (under-grads, post-grads,etc.)

20 documents could not be indexed
-Miss ratio : 13.6%

Normally these documents could not be checked

Big documents
Typical thesis:
Over 110 pages
1.2 Mbytes (PDF)
0.5 Mbytes(Plain text)
0.35 Mbytes (Indexed see later)

Typical example on an average Department:
When cross-searching across 2.000 documents
2.000 x 0.35 Mbytes  0.7 Gbytes of raw text per search

Approximate string matching
Levenshtein distance
Hamming distance
Episode distance
Longest common subsequence distance
Suffix Trees
Suffix Automata

Matching Algorithm
Window of N characters in length
Iterate window through text with a step of K characters

Implemented Algorithm
Very fast
-Allows to search quickly across a large number of documents

Language independent
-(e.g. can detect phrases that contain both Greek and English)

Advantages
Disadvantages
Can only work with exact matches
-No flexibility in slight text variations
Allowing for flexibility
Perform match not directly on text but on an index created from text
-Decrease information provided by “ambiguous” content
-Remove content that could be considered “noise”

Example #1
Το ερώτημα της Butler σχετικά με την άμεση σύνδεση του σώματος και του κοινωνικού φύλου φέρνει στο προσκήνιο επιπλέον προβληματισμούς που αφορούν από την μία την βιοπολιτική του σώματος και τις πρακτικές πειθάρχησής του, και από την άλλη, τον τρόπο που δημιουργούνται και φιλτράρονται σε μια κοινωνία οι πληροφορίες και οι ιδιότητες που φέρει το υποκείμενο.

56 words
Copy-pasted
Example #2
Με το νέο πλαίσιο προβλέπεται ότι σε περίπτωση που οι σχηματισθείσες προβλέψεις υπολείπονται των αναμενόμενων ζημιών, με βάση την προσέγγιση των εσωτερικών συστημάτων διαβάθμισης (‘IRB’), το ποσό αφαιρείται από τα κύρια στοιχεία των βασικών ιδίων κεφαλαίων.

36 words
Minor para-phrasing

Example #3
Copy table with figures (financial data)

Metrics
Measure
-Match precision
-Match recall
-Time required

Precision - Recall
Full transcript