Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Sandhan - CLIA system - IIT Kharagpur

Presented By: Prof. Sudeshna Sarkar
by

R.Rajendra Prasath

on 16 March 2018

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Sandhan - CLIA system - IIT Kharagpur

Prof. Sudeshna Sarkar
Sandhan
-
The CLIA system

Welcome to the CLIA workshop
@ DAIICT, Gandhi Nagar
Indian Institute of Technology
Kharagpur

28 September 2012 – 30 September 2012
CORE Components Developed BY IIT Kharagpur
AU-KBC | AU-CEG
C-DAC PUNE
DA-IICT
GAUHATI
IIIT Bhubaneswar
IIIT Hyderabad
IIT Bombay
IIT Kharagpur
JADAVPUR
ISI Kolkata
Ministry of Communications and Information Technology (MCIT), Government of India
CLIA Consortium
parse-cml
index-cml
ranker
analysis-bn: Bengali Stemmer
seeds urls collector
synonym injector
near duplicate detector
language resources

CMLifier
(parse-cml)
Cleaned
Web Content
( parse object / parsedata / parsetext )
Cleaned
Web Content
( parse object / parsedata / parsetext )
(index-cml)
Indexer
Inverted Index
OPIC score with different boost factors
Web Graph generation & analysis
Ranking with results diversification
Ranker: Our Experiments
CMLifier
- The Overall Architecture
Link–to–Text ratio heuristic to eliminate the noisy blocks in the web document
Different html tag filtering approaches implemented

The CMLifier output (the parsed content) is the input to the indexer
Approaches Used:
Duplicate Sentences removal at
Phrase Level
Sentence Level
HTML entities conversion: “&” -> “&”;
Meta tags filter improved
Hex Code to Unicode conversion has been improved
Markers added to: title, content, description and keywords
Current Version:
Next Update:
Content with header and paragraph tags
Key phrases extraction from each document
Extracting informative segments with their topicality
To incorporate near duplicate content detection
parse-cml
Indexer
Content Extraction Accuracy
- with effective Noise Filtering
parse-cml
- The Overall Architecture
Average Content Extraction Speed: 0.012 sec per page
[with NO Font transcoder / Language / Domain Identification]
Duplicate / Near Duplicate Detection
Bengali Stemmer
- plugin: analysis-bn
List lookup
noun root, noun suffix, verb root, verb suffix
Current:
Next Update:
TRIE based Bengali Stemmer
Stemming Accuracy: 80.58%
Customized way of setting the Similarity Function
Segmented Text Importance Computation (STIC) Scoring
Current:
Next Attempt:
Seed URLs Collections
Automatic way of collecting seed URLs
Language: Bengali
used Entities and Patterns to generate user information needs
33769 Bengali seeds Collected
Domain: Tourism seed URLs
Current:
Statistics:
The Synonym Engine creates a common entry in the Inverted Index for all variant spellings of the same word
Synonym Engine:
The list of spelling variations in XML format
During searching, variant spellings are mapped to the normalized entry in the inverted index
Language Resources of IITKGP
Generated:
30 Bengali Queries [81-110] for CLIA testing
Seed URLs collection:
33,769 Bengali, 1445 English
Synonym list:
4000 NEs with an average of 4 spelling variations per NE
Generated Test Data
to build Bengali Classifier whose accuracy is 85.42%
No of Bengali documents collected:
900
Domains:
Tourism, General
Named Entity annotation completed
on 100,000 word Bengali corpus
Named Entity Transliteration
of almost 8000 Bengali words
1300 Multiword Expressions
relevant to Tourism
700 query templates created
to understand users' intentions
Re-written the Bengali Analyzer
to suit Lucene-3.x
Identification of Topics and Subtopics
is in progress
parse-cml
- Flow Chart
Content Extraction
- (parse-cml)
Web Page
Extracted Cleaned Content
index-cml
The CMLifier output (the parsed content) is the input to the indexer
Indexer builds a document object with the following fields:
url, title, content, meta keywords, domain, lang, site, NER, MWE, meta description, host, digest, etc
NEs and MWEs are identified during the indexing and the identified NEs and MWEs are added to the fields:
NER and MWE.
Each field configured with one of two options:
index(searchable):
url, title, content, domain, lang, NER, MWE, meta description
stored(store) :
url, title, domain, lang, meta keywords, meta desc
The created fields are combined into
single Document object
which is written to the Index

index-cml
Current:
Fields and their options like store / index are pre-determined

Next Update:
index-cml is rewritten to handle configurable fields so as to handle higher versions of Lucene

Future Plans:
Ranker plugin:
scoring-stic
New
Query Expansion
for Bengali Retrieval
Cross Lingual Information Retrieval
Bengali - Hindi
Bengali - English
Incorporating
Pseudo Relevance Feedback (PRF)
TRIE based Bengali Stemmer
Incorporating Topics/Subtopics into document scoring
Creation of Language Vertical Resources

Ranking with Sandhan
Ranking Experiments with Sandhan By IITKGP
System 1: The Base (current) Sandhan system
System 2: Web-graph with link rank (or Page rank) algorithm for Bengali documents
overall applying link rank increases retrieval effectiveness
System 3: Sandhan system with near duplicate detection algorithm
diverse results are presented in the top 10
System 4: System 4 combines the benefits of both System 2 and System 3
Additionally, we have altered boosting in the core indexer module, indexed in-linked anchor texts for a document, content richness based document boosting, added a URL decoder to collect and index words from ill formed urls

We use following features to award each document a score:
a. OPIC Score (the scoring strategy in the current Sandhan System)
b. LINK Score (Score obtained for each document by applying link ranking)
c. CONTENT Score (Scores obtained based on documents content richness)
Crawl Name : bn-NEs2Urls-All (frozen)
Crawl Size : 1,18,159 Total Docs [77,986 Bengali Docs]
Fields Boost : NEs(m), NEs in Content(m),
m = 4.0, 8.0, 11.0, 14.0, 20.0 (weights) & tourism(10.0-Fixed)
Queries Range : 81 - 110 (in Bengali)

m = 5.0 m = 8.0 m = 11.0 m = 14.0 m = 20.0

P@5 0.66 0.52 0.56 0.55 0.52
P@10 0.53 0.45 0.46 0.46 0.45
Expr – 1: Sandhan with different Boost factors to NEs
Crawl Name : bncrawl-26122011 (frozen)
Crawl Size : 1,75,778 Total Docs [1,32,204 Bengali Docs]
Fields Boost : NEs (12.0), MWEs(12.0), NEs and MWEs - each in Content(12.0) (weights) & Domain = tourism (10.0)
m = 4.0, 8.0, 11.0, 14.0, 20.0 (weights) & tourism(10.0-Fixed)
Queries Range : 81 - 110 (in Bengali)

p@5 p@10 p@15 p@20 p@25 p@30

avg 0.55 0.46 0.37 0.32 0.28 0.25
Expr – 2: Sandhan with Local Crawl
Crawl Name : bncrawl-26122011 (frozen)
Crawl Size : 1,75,778 Total Docs [1,32,204 Bengali Docs]
Fields Boost : NEs (12.0), MWEs(12.0), NEs and MWEs-each in Content(12.0), (weights) NEs-in Title(12.0), NEs-in Url(14.0) and Domain = tourism (10.0)
Queries Range : 81 - 110 (in Bengali)

p@5 p@10 p@15 p@20 p@25 p@30

avg 0.55 0.45 0.38 0.32 0.29 0.26
Expr – 3: Sandhan with Local Crawl
Performance Comparison of 4 rankings
Bengali Monolingual Retrieval
- Sandhan Performance – Comparison* with Google
Sandhan1: as tested on 17 February 2012
Sandhan2: as tested on 21 February 2012, depth 3, IITB system
CML - An Example
Participating Institutions
C-DAC NOIDA
Visit at:
http://www.tdil-dc.in/sandhan

Funded By:
Crawl Name : bncrawl-26122011 (frozen)
Crawl Size : 1,75,778 Total Docs [1,32,204 Bengali Docs]

Fields Boost : NEs (m), MWEs(m), NEs and MWEs - each in Content(m) (weights) & Domain = tourism (10.0),
where m = 4.0, 8.0, 11.0, 14.0, 20.0 (weights)
and
NEs-in Title (12.0), NEs-in Url (14.0)

Queries Range : 81 - 110 (in Bengali)

p@5 p@10 p@15 p@20 p@25 p@30
avg 0.55 0.45 0.38 0.32 0.29 0.26
9/24/2012
Expr – 3: Sandhan with Local Crawl
Crawl Name : bn-NEs2Urls-All (frozen)
Crawl Size : 1,18,159 Total Docs [77,986 Bengali Docs]

Fields Boost : NEs(m), NEs in Content(m),
m = 1.0, 4.0, 8.0, 11.0, 14.0, 20.0 (weights)
& domain = tourism(10.0-Fixed)

Queries Range : 81 - 110 (in Bengali)

m = 1.0 m = 5.0 m = 8.0 m = 11.0 m = 14.0 m = 20.0
P@5 0.42 0.66 0.52 0.56 0.55 0.52
P@10 0.36 0.53 0.45 0.46 0.46 0.45
9/24/2012
Expr – 1: Sandhan with different Boost factors to NEs (With Local Crawl)
9/24/2012
Step – 1: Considered Features to Find a Focused Entity
Presence of an entity in the fields: title, meta description, url, in-linked anchors
No of the sentences in which the entity occurs
tf of the entity in the document
idf of the entity
Step – 2: Measure the support for the Focused Entity
F_SCORE  Fraction of sentences in which the entity occurs
PR_SCORE  Occurrence of the entity in the top ranked sentences
POS_SCORE  Occurrence of an entity towards sentence beginning
contentScore = F_SCORE+PR_SCORE+POS_SCORE
Focused Entity Identification & Scoring
9/24/2012
How to define Content Richness?
A document d may have specific information about certain entities (currently place name)
These entities may be present across different segments of the document content.
Segments may be paragraphs, sentences, etc
These entities may be presented as probable query terms by the users
The importance of an entity with respect to the document d may be an informative feature for document scoring
So our aim is to define a scoring strategy that estimates the amount of supporting information at segment level for such entities in that document content.
Finally use this score as a measure of richness of the doc content
Content Richness Computation
9/24/2012
System 2: Web-graph with link ranking
9/24/2012
Performs document similarity based on LUCENE practical formula

Score(q, d) = coord(q, d) queryNorm(q) t in q (tf(t in d) idf(t)2
t.getBoost() norm(t, d))


Score(q, d) = coord(q, d)  queryNorm(q)  \sum_{ t in q} (tf(t in d)  idf(t)2
 t.getBoost()  norm(t, d))

Where
tf = term frequency - measure of how often a term appears in a document
idf - measure of how often the term appears across the index
coord = number of terms in the query that were found in the document
lengthNorm = measure of the importance of a term according to the total number of terms in the field
queryNorm = normalization factor so that queries can be compared
boost (index) = boost of the field at index-time
boost (query) = boost of the field at query-time
System 1: The Base(current) Sandhan system
9/24/2012
Current:
List lookup
Resources used:
noun root, noun suffix, verb root, verb suffix

Accuracy : 80.58%
Speed : 11 seconds to stem 14, 807 unique terms

Latest Update sent on: 23 May 2012 14:22 IST
analysis-bn: Bengali Stemmer
9/24/2012
Core Content Filtering module:
CMLifier (plugin: parse-cml)
clean web pages by identifying and removing noisy segments
extract cleaned text content of the web documents
Parsing
9/24/2012
Ranker module in pluggable format
Cross Lingual Information Retrieval
Bengali – Hindi
Bengali - English
TRIE based Bengali Stemmer
Creation of language specific resources in Bengali

Near duplicate documents detection with map-reduce
Incorporating Pseudo Relevance Feedback (PRF)
Incorporating topics / subtopics into document scoring
New Query Expansion: Incorporating Clustering By Direction (CBD) algorithm to expand Bengali queries
Future Attempts
Crawl Name : bncrawl-26122011 (frozen)
Crawl Size : 1,75,778 Total Docs [1,32,204 Bengali Docs]

Fields Boost : NEs (m), MWEs(m), NEs and MWEs - each in Content(m) (weights) & Domain = tourism (10.0)
where m =1.0, 4.0, 8.0, 11.0, 14.0, 20.0 (weights)

Queries Range: 81 - 110 (in Bengali)

p@5 p@10 p@15 p@20 p@25 p@30
avg 0.55 0.46 0.37 0.32 0.28 0.25
9/24/2012
Expr – 2: Sandhan with Local Crawl
Steps for Improvement
9/24/2012
We collected 3000 Named Entities of popular places in India
We created a list having 1300 Multiword Expressions relevant to tourism.
Spelling variation list created having 1000 Named Entities in Bengali with an average of 4 spelling variations per word
For each named entity in the NE list, we used other search engines (Google, Yahoo!, Rediff) and collected relevant URL’s. (A total of 33,769 URL’s )
Bengali Stemmer accuracy is improved with inclusion of new words to the root and incorporating additional rules
Bengali Monolingual search
9/24/2012
System 1: Basic Sandhan
System 2: Basic Sandhan with Link Rank
System 3: Basic Sandhan with Link Rank & near dedup
System 4: Basic Sandhan with Link Rank, near dedup and content, url and in-link anchor text scoring
Summary: p@d (d=5,10) of Systems: 1 – 4
9/24/2012
System 1: Basic Sandhan System 2: Basic Sandhan with Link Rank
System 3: Basic Sandhan with Link Rank & near dedup
System 4: Basic Sandhan with Link Rank, near dedup and content scoring
Comparison: p@5 of Systems: 1 – 4
9/24/2012
Ranking Experiments: The Overall Comparison
9/24/2012
We have combined the benefits of link ranking and near duplicate detection with boost factors of content, url and in-linked anchor texts. For this, we made the following changes:

Altered certain boosting in the core indexer module
Indexed in-linked anchor texts for a document
Content Richness based document boosting
Added a URL decoder to index terms from ill-formed URLs

Additionally, we used the following features to award each document a score:
OPIC Score (current scoring strategy of Sandhan)
LINK Score (Score obtained by applying link ranking)
CONTENT Score (Score based on content richness)
System 4: Sandhan with link ranking, near dedup algorithm & content, in-link scoring
9/24/2012
Document Reputation:
The goodness of the document can be estimated as follows:

Inlink Score = sqrt(number of ExtraDomain Links) + log(Number of intradomain links) + number of anchorTexts Containing the focused entity
Url Score = (Opic Score + link Score ) * subjective site reputation
overall Document Score = Inlink Score+ Url Score - given as document boost

Here subjective site reputation is based on certain preferences like boosting govt. websites higher
Content Richness Computation (contd…)
9/24/2012
Current Sandhan system finds “EXACT Duplicates”, but does not provide facilities to detect near duplicate documents
We developed a near duplicate detection algorithm and eliminated near duplicated documents from the index
Currently there is no evaluation strategy for duplicated results. For a query, if a top ranked document gets duplicated n times then all these n duplicate copies of the documents may also be ranked / clustered together with the original one. This pulls the p@k documents up.
it is essential to identify & remove near duplicate docs
Removal of near duplicates may result in the drop of precision below the precision of System 2. However this increases the diversity of results among top 10 documents.
System 3: Sandhan with near duplicate detection algorithm
9/24/2012
System 1:
The Base (current) Sandhan system

System 2:
Web-graph with link rank algorithm

System 3:
Sandhan system with near duplicate detection algo

System 4:
Sandhan with link ranking, near dedup algorithm & content, in-link anchor text scoring
Ranking with Sandhan – Our Attempts
9/24/2012
Current:

OPIC score with different boost factors

Web Graph generation & analysis (nutch-0.9 needs upgrade to ver 1.1 or above)

Ranking with results diversification
Ranker Experiments by IITKGP
9/24/2012
Points:

Presently, Near Duplicate Detection should be performed as a separate job like parser, indexer

It is implemented in a subsystem which sequentially performs near duplicate document detection and elimination

To make this option scalable, it is necessary to rewrite this subsystem with Map-Reduce framework
Near Duplicate Detection – Overheads
9/24/2012
Algorithm: Step - II
Near Duplicate Detection and Deletion

For each document di in Master Index do:
If marked_for_delete (di)=true then SKIP to next document
Else current_doc=di;
hashset_currentDoc=getHashset (current_doc);
B= Construct_Boolean_Query (hashset_currentDoc);
Candidate_documentSet = queryInvertedHashIndex (B);
  For TOP k docs dki in Candidate_documentSet do:
candidate_document=dki;
candidate_hash=getHash (candidate_document);
overlap_coefficient=CalculateOverlapCoefficient (hashset_currentDoc, hashset_candidateDoc);
If (overlap_Coefficient greater than threshold)
add candidate_doc to marked_nearDuplicateSet
End if
End For
deleteBasedOnPolicy (marked_nearDuplicateSet UNION current_doc)
End For
9/24/2012
 
Hash Generation
 
For each document di in D that has been Crawled and Parsed
tokenSet = getTopNTokens(di, N);
ngramSet = createNGram(tokenSet);
hashSet = GenerateHashes(ngramSet);
buildInvertedHashIndex (hashSet, di);
End For
Algorithm: Step - I
9/24/2012
Near Duplicate Detection - Illustration
9/24/2012
We use an XML file to maintain the list of spelling variations

The Synonym Engine parses the XML file containing the spelling variations and creates a common entry in the Inverted Index for all the variant spellings of the same word

During searching, variant spellings are mapped to the normalized entry in the index

Current status:
Synonym List: Generated 4000+ Named Entities with the average of 4 spelling variations per Named Entity in Bengali Language
Synonym Injector
9/24/2012
Experiment – 1: Accuracy: 80.58%
# Terms used from Bengali Web docs: 2,070 unique terms
Observation: NEs are getting stemmed correctly in most cases and only suffix stripping takes place in some cases.

Experiment - 2 : Accuracy: 75.21%
# Terms used from Bengali FIRE corpus: 8,146 unique terms
Bengali Stemmer Performance
9/24/2012
Index-cml uses two subsystems during indexing

Based on the language of the content, the corresponding language analyzer is invoked

Analysis-XX [ XX = “bn” for Bengali language]
Synonym Injector [presently we provided only for Bengali]
Indexing Phase
9/24/2012
URL decoder will be updated in parse-cml:

Example: Currently Sandhan is not handling the decoding of URLs in UTF-8. The Actual URL of the Wikipedia article on “হাওড়া-ব্রিজ-রবীন্দ্র-সেতু” looks as follows:

http://wikimapia.org/7033110/bn/%E0%A6%B9%E0%A6%BE%E0%A6%93%E0%A7%9C%E0%A6%BE-%E0%A6%AC%E0%A7%8D%E0%A6%B0%E0%A6%BF%E0%A6%9C-%E0%A6%B0%E0%A6%AC%E0%A7%80%E0%A6%A8%E0%A7%8D%E0%A6%A6%E0%A7%8D%E0%A6%B0-%E0%A6%B8%E0%A7%87%E0%A6%A4%E0%A7%81

After URL decoding, the following output should be added to the index:
http://wikimapia.org/7033110/bn/হাওড়া-ব্রিজ-রবীন্দ্র-সেতু

Note: The decoded URL generates the tokens: হাওড়া, ব্রিজ, রবীন্দ্র, সেতু which are searchable where as the first URL is not generating these tokens
parse-cml: What’s Next?
9/24/2012
Current:
We have collected tourism specific seed URLs in Bengali using popular search engines, blogs, forums and travel sites
Entities and their associated query patterns related to tourism domain are used to generate user information needs:
For example (in English):
For the entity – “Darjeeling”, we have generated query patterns like, “how to reach Darjeeling”, “cheap accommodation in Darjeeling”, “places to see in Darjeeling”, etc

Statistics:
33,769 Bengali seeds collected
Type: Tourism related seed URLs
Seed URLs Collection
9/24/2012
Resources Developed in Bengali
9/24/2012
Language Horizontal Tasks - Core Modules:
Parsing (parse-cml plugin)
Indexing (index-cml plugin)
Ranking
Language Vertical Tasks - Additional subsystems:
Language Resources Developed in Bengali
Language Analyzer: analysis-bn (Bengali Analyzer)
Bengali Stemmer
Synonym Injector
Near Duplicate Detection and Elimination
CORE Components - By IIT Kharagpur
S.No Crawls # Documents # Bengali Docs
----------------------------------------------------------------------------------------------------------------------------------
1. bn-NEs2Urls-All (frozen) – 1, 18, 159 – 77, 986
2. bnCrawl-nes2urls-extn-D2 – 4, 20, 714 – 3, 22, 390
3. bncrawl-26122011 (Frozen)* – 1, 75, 778 – 1, 32, 204
4. bnCrawl-FrozenSeeds-D3 – 18, 19, 097 – 8, 12, 013
[Latest Crawl – Additionally crawled to depth 3]

Details of RJs done with bncrawl-26122011(Frozen)* 81 – 110 Bengali Queries:







* Sanity Check:
9/24/2012
Summary
Experiment with different boost factors for NEs
Queries : 81- 110 Bengali Queries
Crawl : 77, 986 Bengali Documents
Fields Boosted : URL(3), title(3), content(10), domain(10), anchor(1); Different boost values for NE
9/24/2012
Experiments on ranking
(With Local Crawl) by varying different boost values
Sandhan1: as tested on 17 February 2012
Sandhan2: as tested on 21 February 2012, depth 3 crawl IITB system
Sandhan Performance – Comparison* with Google
9/24/2012
Final Results – Bengali Monolingual Retrieval
9/24/2012
System 1: Basic Sandhan System 2: Basic Sandhan with Link Rank
System 3: Basic Sandhan with Link Rank & near dedup
System 4: Basic Sandhan with Link Rank, near dedup and content scoring
Comparison: p@10 of Systems: 1 – 4
9/24/2012
System 4: Sandhan with link ranking, near dedup algorithm & content, in-link scoring
Evaluator Feedback:
9 documents including news , blogs are found in top 10 search results with good pictures and tourism related information like pilgrimages, tour plan, transports and attractions in Leh Ladakh.
Query: লেহ লাদাখ
For more details:
http://pr.efactory.de/
9/24/2012
Similar to page rank algorithm of Google
Measure of finding the relative importance of pages across world wide web
What is link ranking?
9/24/2012
Total Execution Time:

Nearly 6 hours
Statistics: Corpus: bnCrawl-D2-deDup
# documents
Size of the Corpus : 4,80,681
Number of Duplicate Documents detected : 2,92,548
Number of Unique Documents (NO duplicates) : 1,44,393
Number of Docs having at least one duplicate : 43,740
Number of Documents having ≥ 500 duplicates : 19
Near Duplicate Detection - Experimental Results
9/24/2012
nutch handles the EXACT duplicates, but NOT Near Duplicates
NEAR DUPLICATES: The task to identify and organize documents that are “nearly identical” to each other … that is, the content of one web page is identical to those of another except for a few characters
Our Near Duplicate Detection module identifies and removes near duplicate contents from the index
Near Duplicate Detection
9/24/2012
Key phrases extraction from Single Document* (contd…)

5. Calculate χ' 2 value:
For each term w, freq(w, g) = co-occurrence frequency of w with g ∈ C
nw = total #terms in the sentences including w
Calculate χ' 2 value using the equation:



Where

nw is the total number of terms in sentences where w appears and
pg is denoted as (the sum of the total number of terms in sentences where g appears) divided by (the total number of terms in the document)

6. Output: keywords - top m (= 20) terms having the largest χ' 2 value
parse-cml: What’s Next?
Y. Matsuo, M. Ishizuka, Keyword extraction from a single document using word co-occurrence statistical information, International Journal on Artificial Intelligence Tools, 13(1): 157-169 (2004)
9/24/2012
Key phrases extraction from Single Document* - Algorithm

1. Preprocess:
Stem words by Porter algorithm (Porter 1980)
extract phrases based on n-grams [here we assumed max(n)= 3]
remove stop words using SMART stop words list (Salton 1988)
2. Select frequent terms:
Ntotal = upto 30% of top frequent terms
3. Cluster frequent terms:
Cluster a pair of terms using Jensen-Shannon divergence (> threshold = 0.95 × log 2)
Cluster a pair of terms using mutual information (> threshold = log(2.0) )
C <- #clusters obtained
4. Calculate the expected probability:
ng = # terms co-occurring with g ∈ C
The expected probability pg = ng / Ntotal
parse-cml: What’s Next?
9/24/2012
Content Extraction Speed: 0.012 sec per page
[with NO Font transcoder / Language / Domain Identification]
parse-cml - Experimental Results
9/24/2012
HTML to CML conversion: An Example
9/24/2012
Evaluator Feedback:
Most of the documents including news, blogs are found diverse in the top 10 search results. These documents contain useful information on tourism, tours, pilgrimages, transports & attractions in Leh Ladakh
Query: লেহ লাদাখ
System 3: Sandhan with near duplicate detection algorithm – Experimental Results
9/24/2012
Evaluator Feedback:
Only one document of Wikipedia page is found and describes the district of Ladakh mentioning Leh as its largest city including a map and an image.
The rest of the documents are all duplicates.
Query: লেহ লাদাখ
System 2: Web-graph with link ranking
- Experimental Results
9/24/2012
Evaluator Feedback:
4 documents out of the last 5 results in top 10 documents illustrative pictures and information on tourism with special focus on tour plan, pilgrimages, transport and attractions in Leh Ladakh. The rest of the retrieved pages contains hyperlinks leading to Leh and Ladakh.
Query: লেহ লাদাখ
System 1: The Base(current) Sandhan system
- Experimental Results
9/24/2012
Actual Query: লেহ লাদাখ

Expanded Lucene Query:
+(NER:লেহ^12.0 content:লেহ^12.0 title:লেহ^12.0 host:লেহ^12.0 url:লেহ^12.0 anchor:লেহ^12.0 NER:লাদাখ^12.0 content:লাদাখ^12.0 title:লাদাখ^12.0 host:লাদাখ^12.0 url:লাদাখ^12.0 anchor:লাদাখ^12.0) +lang:bn domain:tourism
url:"লেহ লাদাখ"~2147483647^3.0 anchor:"লেহ লাদাখ"~4
content:"লেহ লাদাখ"~2147483647^10.0
title:"লেহ লাদাখ"~2147483647^3.0 host:"লেহ লাদাখ"~2147483647^2.0


Total Hits:
3,30,108 documents
Corpus:
bnCrawl-FrozenSeeds-D3
Document
Scoring in Nutch
- An example
Details
9/24/2012

Plans to add in the next update:

Content with header and paragraph tags

URL decoder will be updated in parse-cml

Key phrases extraction from each document

Extracting informative segments with their topicality
parse-cml: What’s Next?
Issues: identified errors in eliminating segments having form feeders, news feeders, copyright info, segments having short messages, menu items in which link-to-text ratio fails
Content Extraction Accuracy
The fraction of segments correctly extracted
 ------------------------------------------------------------------
Total number of Segments
9/24/2012
parse-cml - Experimental Results
9/24/2012
CLIA Workshop @ DA-IICT
28 – 30 September 2012
Sandhan – The CLIA system
9/24/2012
Details
Overview

Identify certain selected features to find the Focused Entity in a document

Once a Focused Entity is found then measure the support for the entity in the content

Use this measure to score the document
Document Fields:

url
Title
Description
Content
Domain
Named Entities
In-links
In-linked anchors text
Crawl Datum Score
Focused Entity Identification & Scoring
9/24/2012
Places to See
Overview
City Info
Extracting segments with their topics
parse-cml: What’s Next?
9/24/2012
Tagged Content
Cleaned Content
Site: http://www.exploredarjeeling.com/
Content with header and paragraph tags:
parse-cml: What’s Next?
Extracted
Key Phrases
Darjeeling Glimpse Summer Snow fall Baddogra Siliguri Adventure sports best time to visit Advance hotel booking ava art gallery Buddhism Bhutias beauty of darjeeling batasia loop burdwan palace botanical garden Cultures Climbing Cinchona Common leopard Toy Train
9/24/2012
Cleaned Content
Site: http://www.exploredarjeeling.com/
Key phrases extraction from each document content and anchor texts
- An example
parse-cml: What’s Next?
9/24/2012
A webpage
Advt
Actual Content
Advt
Advt
Forms
Logo
vertical Menu
Banner
Horizontal Menu
Site Details
Last update sent on: 23 February 2012 23:59 IST
9/24/2012
Removes noise from web pages & extracts the clean content
CMLifier (parse-cml)
Index
Noise Filtering:
Title, Meta Info,
Out links, Content
Lang / Domain Identifier
Font Transcoder
WWW
Web Page
Parse/
Parse Data/
Parse Text
Current Version: Existing
Duplicate Sentences removal at i) Phrase Level ii) Sentence Level
HTML entities conversion: “” -> “ Meta tags filter improved
Markers added to: title, content, description and keywords
Updated:
Hex Code to Unicode conversion has been improved
topKwords are identified and extracted from document content
Document Boosting score is computed from the content richness
CMLifier – (parse-cml) plugin
9/24/2012
Top k Documents
Candidate Document Set
Document
Document to be Retained
Documents to be deleted
Duplicate Deletion Policy
Near Duplicate Documents
Near Duplicate Check
Inverted Hash Index
Boolean Query
Hash Query Generator
Hash set
Near Duplicate Detection (contd…)
The CMLifier output (the parsed content) is the input to the indexer
Indexer builds a document object with the following fields: url, title, content, meta keywords, domain, lang, site, NER, MWE, meta description, host, digest, etc
NEs and MWEs are identified during the indexing and the identified NEs and MWEs are added to the fields: NER and MWE.
Each field configured with one of two options:
Indexed (searchable) and Stored:
url, title, content, domain, lang, NER, MWE
The created fields are combined into a single Document object which is written to the Index
CMLifier (parse-cml)
Document Object
Inverted
Index
Field n
Field 2
Field 1
Parse/
Parse Data /
Parse Text
9/24/2012
Indexer – (index-cml) plugin
9/24/2012
Single Postings list for all synonyms …
Single Postings list for all synonyms …
Single Postings list for all synonyms …
নরিম্যান | নারিমান | নারিমাণ | নরিম্যন | নরীম্যান | নরিম্যাণ | নরিমাণ | নরিমান
নিউ | নয়া | নতুন
ভারতবর্ষ | ইন্ডিয়া | ভারত | ইন্দিয়া | ইনডিয়া
Synonym Engine
<synonyms>
<group>
<syn>ভারতবর্ষ</syn>
<syn>ইন্ডিয়া</syn>
<syn>ভারত</syn>
<syn>ইন্দিয়া</syn>
<syn>ইনডিয়া</syn>
</group>
<group>
<syn>নিউ</syn>
<syn>নয়া</syn>
<syn>নতুন</syn>
</group>
<group>
<syn>নরিম্যান</syn>
<syn>নারিমান</syn>
<syn>নারিমাণ</syn>
<syn>নরিম্যন</syn>
<syn>নরীম্যান</syn>
<syn>নরিম্যাণ</syn>
<syn>নরিমাণ</syn>
<syn>নরিমান</syn>
</group>
</synonyms>
The Synonym Engine creates a common entry in the Inverted Index for all variant spellings of the same word
During searching,
variant spellings are mapped to the normalized entry in the inverted index
The list of spelling variations in XML format
Synonym Injector (contd…)
9/24/2012
Travel From - to
Transportation
Historical places
Weather
hotels
Extracting segments with their topics
parse-cml: What’s Next?
9/24/2012
START
STOP
Indexer
Link – to – Text
ratio
Html Cleaner
Library
Parse Object /
Parse (Data / Text)
Generate FIELDS: title, url, description, keywords,
topKwords, content, lang, NER,
MWE, boost, digest, domain,
host, site, segment
Filter Noisy Blocks
Extract Clean CONTENT
Parse
Obtain HTML
Tag Balanced HTML Page
Crawled Content
parse-cml : Flow Chart
Focused Entity
Entity
Entity
Entity
Entity
Entity
Entity
Entity
Entity
9/24/2012
Entity / Focused Entity
9/24/2012
parse-cml (contd…)
Extracted
CLEANED
Content
Extract
The Text
B7
If you are visiting Agra India for the first time then tourist offices and authorized travel agencies are good place to gather some basic information about the city …
B6
Apart from Taj Mahal, Agra has two other world heritage sites one of them is Red Fort and other is Fatehpur Sikri. The Red Fort Agra is also located on the other side of the river Yamuna …
B7
B5
B6
Qualified Blocks
Link – to – Text
Ratio
B5
Agra, the beautiful city is situated on the banks of holy river Yamuna. Agra is the third largest city in the state of Utter Pradesh in India and a prominent tourist destination …
Extracted Block Text
Structural
Mining
B10
B7
B5
B6
B2
B4
B3
B9
B8
B1
Segmenting into blocks
Web Page
Initially, Documents are given null score
Then the score of each qualified segment is constructed incrementally
Dr Rajendra Prasath
Full transcript