Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

An automatic wrapper generation process for large scale

No description
by

Vassilis Poulopoulos

on 2 October 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of An automatic wrapper generation process for large scale

Palo Services
page-level
wrappers

group-of-page
validation

frequent
regeneration
of wrappers

self-discovery

self-healing
Our novel crawling mechanism
Evaluation II
Architecture
An automatic wrapper generation process for large scale crawling of news websites
crawling of news websites

I. Varlamis (HUA), N. Tsirakis (Palo)
V. Poulopoulos (Palo), P. Tsantilas (Palo)

analyzing thousand of sources daily in many different languages

discovering new sources

collecting hundreds of thousands of articles

ready to enter new markets
Evaluation (..real)
Evaluation
Areas of category links
Article Extraction

"Good news": 94%

"Bad news": 6%


Evaluation III
Palo.gr

Palo.rs
Palo.com.cy

Palo PRO
data δεδομένα Подаци të dhëna veri
1.000.000(!) objects daily in many languages

~100.000/day from
sites
and
blogs

sites
,
blogs
, twitter, facebook, youtube, forums, etc.
...the challenge
easy and automated data collection

not all pages provide channels

new sources of information

current sources frequently changing

improperly structured pages
simulate a human reader
Locate links to category pages
"short" text in link

point in the same domain

relative URLs are usually "short"

positioned in specific areas of the webpage

close to each other

omit common mistakes

Locating links to article pages
text in link is long

size of URL is usually long

links appear in lists

link text is usually the article title


Wrapper induction
large article body
title is separate
title and body are "close"
media files in between
common meaning
title already acquired

Testing Procedure (categories)
Sources
95 major and minor news portals and blogs in Greek language
Categories
2345 categories

2117 distinct category pages (90%)
Category Areas
At least one category area in each home page

Average number of category areas:
2.6
!!!
Category page analysis
"No article" pages
65% have one (1) empty
Number of articles
90668 links to articles!
Rules for crawling
Rules extracted for all pages

Average number of rules: 7

Empty page rules (0.5%)

Articles with errors
40% articles with error have no body

3.6% real errors: pages with high complexity (media and ads in body)
THANK YOU
Articles with errors
40% articles with error have no body

3.6% real errors: pages with high complexity (media and ads in body)
Vassilis Poulopoulos, PhD

Research & Development Director
pv@paloservices.com
Conclusion / Future Work
Articles with errors
40% articles with error have no body

3.6% real errors: pages with high complexity (media and ads in body)
Advanced crawler in large scale environments

Enhancements

Self-healing

Self-correction
Full transcript