Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Sandhan - CLIA system - IIT Kharagpur

Presented By: Prof. Sudeshna Sarkar
by

R.Rajendra Prasath

on 28 September 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Sandhan - CLIA system - IIT Kharagpur

Prof. Sudeshna Sarkar Sandhan - The CLIA system Welcome to the CLIA workshop
@ DAIICT, Gandhi Nagar Indian Institute of Technology
Kharagpur 28 September 2012 – 30 September 2012 CORE Components Developed BY IIT Kharagpur AU-KBC | AU-CEG C-DAC PUNE DA-IICT GAUHATI IIIT Bhubaneswar IIIT Hyderabad IIT Bombay IIT Kharagpur JADAVPUR ISI Kolkata Ministry of Communications and Information Technology (MCIT), Government of India CLIA Consortium parse-cml
index-cml
ranker
analysis-bn: Bengali Stemmer
seeds urls collector
synonym injector
near duplicate detector
language resources CMLifier (parse-cml) Cleaned
Web Content ( parse object / parsedata / parsetext ) Cleaned
Web Content ( parse object / parsedata / parsetext ) (index-cml) Indexer Inverted Index OPIC score with different boost factors
Web Graph generation & analysis
Ranking with results diversification Ranker: Our Experiments CMLifier - The Overall Architecture Link–to–Text ratio heuristic to eliminate the noisy blocks in the web document
Different html tag filtering approaches implemented

The CMLifier output (the parsed content) is the input to the indexer Approaches Used: Duplicate Sentences removal at
Phrase Level
Sentence Level
HTML entities conversion: “&” -> “&”;
Meta tags filter improved
Hex Code to Unicode conversion has been improved
Markers added to: title, content, description and keywords Current Version: Next Update: Content with header and paragraph tags
Key phrases extraction from each document
Extracting informative segments with their topicality
To incorporate near duplicate content detection parse-cml Indexer Content Extraction Accuracy - with effective Noise Filtering parse-cml - The Overall Architecture Average Content Extraction Speed: 0.012 sec per page [with NO Font transcoder / Language / Domain Identification] Duplicate / Near Duplicate Detection Bengali Stemmer - plugin: analysis-bn List lookup
noun root, noun suffix, verb root, verb suffix Current: Next Update: TRIE based Bengali Stemmer Stemming Accuracy: 80.58% Customized way of setting the Similarity Function
Segmented Text Importance Computation (STIC) Scoring Current: Next Attempt: Seed URLs Collections Automatic way of collecting seed URLs
Language: Bengali
used Entities and Patterns to generate user information needs 33769 Bengali seeds Collected
Domain: Tourism seed URLs Current: Statistics: The Synonym Engine creates a common entry in the Inverted Index for all variant spellings of the same word Synonym Engine: The list of spelling variations in XML format During searching, variant spellings are mapped to the normalized entry in the inverted index Language Resources of IITKGP Generated: 30 Bengali Queries [81-110] for CLIA testing
Seed URLs collection: 33,769 Bengali, 1445 English
Synonym list: 4000 NEs with an average of 4 spelling variations per NE
Generated Test Data to build Bengali Classifier whose accuracy is 85.42%
No of Bengali documents collected: 900
Domains: Tourism, General
Named Entity annotation completed on 100,000 word Bengali corpus
Named Entity Transliteration of almost 8000 Bengali words
1300 Multiword Expressions relevant to Tourism
700 query templates created to understand users' intentions
Re-written the Bengali Analyzer to suit Lucene-3.x
Identification of Topics and Subtopics is in progress parse-cml - Flow Chart Content Extraction - (parse-cml) Web Page Extracted Cleaned Content index-cml The CMLifier output (the parsed content) is the input to the indexer
Indexer builds a document object with the following fields:
url, title, content, meta keywords, domain, lang, site, NER, MWE, meta description, host, digest, etc
NEs and MWEs are identified during the indexing and the identified NEs and MWEs are added to the fields: NER and MWE.
Each field configured with one of two options:
index(searchable): url, title, content, domain, lang, NER, MWE, meta description
stored(store) : url, title, domain, lang, meta keywords, meta desc
The created fields are combined into single Document object which is written to the Index index-cml Current: Fields and their options like store / index are pre-determined

Next Update: index-cml is rewritten to handle configurable fields so as to handle higher versions of Lucene Future Plans: Ranker plugin: scoring-stic
New Query Expansion for Bengali Retrieval
Cross Lingual Information Retrieval
Bengali - Hindi
Bengali - English
Incorporating Pseudo Relevance Feedback (PRF)
TRIE based Bengali Stemmer
Incorporating Topics/Subtopics into document scoring
Creation of Language Vertical Resources Ranking with Sandhan Ranking Experiments with Sandhan By IITKGP System 1: The Base (current) Sandhan system
System 2: Web-graph with link rank (or Page rank) algorithm for Bengali documents
overall applying link rank increases retrieval effectiveness
System 3: Sandhan system with near duplicate detection algorithm
diverse results are presented in the top 10
System 4: System 4 combines the benefits of both System 2 and System 3
Additionally, we have altered boosting in the core indexer module, indexed in-linked anchor texts for a document, content richness based document boosting, added a URL decoder to collect and index words from ill formed urls

We use following features to award each document a score:
a. OPIC Score (the scoring strategy in the current Sandhan System)
b. LINK Score (Score obtained for each document by applying link ranking)
c. CONTENT Score (Scores obtained based on documents content richness) Crawl Name : bn-NEs2Urls-All (frozen)
Crawl Size : 1,18,159 Total Docs [77,986 Bengali Docs]
Fields Boost : NEs(m), NEs in Content(m),
m = 4.0, 8.0, 11.0, 14.0, 20.0 (weights) & tourism(10.0-Fixed)
Queries Range : 81 - 110 (in Bengali)

m = 5.0 m = 8.0 m = 11.0 m = 14.0 m = 20.0

P@5 0.66 0.52 0.56 0.55 0.52
P@10 0.53 0.45 0.46 0.46 0.45 Expr – 1: Sandhan with different Boost factors to NEs Crawl Name : bncrawl-26122011 (frozen)
Crawl Size : 1,75,778 Total Docs [1,32,204 Bengali Docs]
Fields Boost : NEs (12.0), MWEs(12.0), NEs and MWEs - each in Content(12.0) (weights) & Domain = tourism (10.0)
m = 4.0, 8.0, 11.0, 14.0, 20.0 (weights) & tourism(10.0-Fixed)
Queries Range : 81 - 110 (in Bengali)

p@5 p@10 p@15 p@20 p@25 p@30

avg 0.55 0.46 0.37 0.32 0.28 0.25 Expr – 2: Sandhan with Local Crawl Crawl Name : bncrawl-26122011 (frozen)
Crawl Size : 1,75,778 Total Docs [1,32,204 Bengali Docs]
Fields Boost : NEs (12.0), MWEs(12.0), NEs and MWEs-each in Content(12.0), (weights) NEs-in Title(12.0), NEs-in Url(14.0) and Domain = tourism (10.0)
Queries Range : 81 - 110 (in Bengali)

p@5 p@10 p@15 p@20 p@25 p@30

avg 0.55 0.45 0.38 0.32 0.29 0.26 Expr – 3: Sandhan with Local Crawl Performance Comparison of 4 rankings Bengali Monolingual Retrieval - Sandhan Performance – Comparison* with Google Sandhan1: as tested on 17 February 2012
Sandhan2: as tested on 21 February 2012, depth 3, IITB system CML - An Example Participating Institutions C-DAC NOIDA Visit at:
http://www.tdil-dc.in/sandhan Funded By: Crawl Name : bncrawl-26122011 (frozen)
Crawl Size : 1,75,778 Total Docs [1,32,204 Bengali Docs]

Fields Boost : NEs (m), MWEs(m), NEs and MWEs - each in Content(m) (weights) & Domain = tourism (10.0),
where m = 4.0, 8.0, 11.0, 14.0, 20.0 (weights)
and
NEs-in Title (12.0), NEs-in Url (14.0)

Queries Range : 81 - 110 (in Bengali)

p@5 p@10 p@15 p@20 p@25 p@30
avg 0.55 0.45 0.38 0.32 0.29 0.26 9/24/2012 Expr – 3: Sandhan with Local Crawl Crawl Name : bn-NEs2Urls-All (frozen)
Crawl Size : 1,18,159 Total Docs [77,986 Bengali Docs]

Fields Boost : NEs(m), NEs in Content(m),
m = 1.0, 4.0, 8.0, 11.0, 14.0, 20.0 (weights)
& domain = tourism(10.0-Fixed)

Queries Range : 81 - 110 (in Bengali)

m = 1.0 m = 5.0 m = 8.0 m = 11.0 m = 14.0 m = 20.0
P@5 0.42 0.66 0.52 0.56 0.55 0.52
P@10 0.36 0.53 0.45 0.46 0.46 0.45 9/24/2012 Expr – 1: Sandhan with different Boost factors to NEs (With Local Crawl) 9/24/2012 Step – 1: Considered Features to Find a Focused Entity
Presence of an entity in the fields: title, meta description, url, in-linked anchors
No of the sentences in which the entity occurs
tf of the entity in the document
idf of the entity
Step – 2: Measure the support for the Focused Entity
F_SCORE  Fraction of sentences in which the entity occurs
PR_SCORE  Occurrence of the entity in the top ranked sentences
POS_SCORE  Occurrence of an entity towards sentence beginning
contentScore = F_SCORE+PR_SCORE+POS_SCORE Focused Entity Identification & Scoring 9/24/2012 How to define Content Richness?
A document d may have specific information about certain entities (currently place name)
These entities may be present across different segments of the document content.
Segments may be paragraphs, sentences, etc
These entities may be presented as probable query terms by the users
The importance of an entity with respect to the document d may be an informative feature for document scoring
So our aim is to define a scoring strategy that estimates the amount of supporting information at segment level for such entities in that document content.
Finally use this score as a measure of richness of the doc content Content Richness Computation 9/24/2012 System 2: Web-graph with link ranking 9/24/2012 Performs document similarity based on LUCENE practical formula

Score(q, d) = coord(q, d) queryNorm(q) t in q (tf(t in d) idf(t)2
t.getBoost() norm(t, d))


Score(q, d) = coord(q, d)  queryNorm(q)  \sum_{ t in q} (tf(t in d)  idf(t)2
 t.getBoost()  norm(t, d))

Where
tf = term frequency - measure of how often a term appears in a document
idf - measure of how often the term appears across the index
coord = number of terms in the query that were found in the document
lengthNorm = measure of the importance of a term according to the total number of terms in the field
queryNorm = normalization factor so that queries can be compared
boost (index) = boost of the field at index-time
boost (query) = boost of the field at query-time System 1: The Base(current) Sandhan system 9/24/2012 Current:
List lookup
Resources used:
noun root, noun suffix, verb root, verb suffix

Accuracy : 80.58%
Speed : 11 seconds to stem 14, 807 unique terms

Latest Update sent on: 23 May 2012 14:22 IST analysis-bn: Bengali Stemmer 9/24/2012 Core Content Filtering module:
CMLifier (plugin: parse-cml)
clean web pages by identifying and removing noisy segments
extract cleaned text content of the web documents Parsing 9/24/2012 Ranker module in pluggable format
Cross Lingual Information Retrieval
Bengali – Hindi
Bengali - English
TRIE based Bengali Stemmer
Creation of language specific resources in Bengali

Near duplicate documents detection with map-reduce
Incorporating Pseudo Relevance Feedback (PRF)
Incorporating topics / subtopics into document scoring
New Query Expansion: Incorporating Clustering By Direction (CBD) algorithm to expand Bengali queries Future Attempts Crawl Name : bncrawl-26122011 (frozen)
Crawl Size : 1,75,778 Total Docs [1,32,204 Bengali Docs]

Fields Boost : NEs (m), MWEs(m), NEs and MWEs - each in Content(m) (weights) & Domain = tourism (10.0)
where m =1.0, 4.0, 8.0, 11.0, 14.0, 20.0 (weights)

Queries Range: 81 - 110 (in Bengali)

p@5 p@10 p@15 p@20 p@25 p@30
avg 0.55 0.46 0.37 0.32 0.28 0.25 9/24/2012 Expr – 2: Sandhan with Local Crawl Steps for Improvement 9/24/2012 We collected 3000 Named Entities of popular places in India
We created a list having 1300 Multiword Expressions relevant to tourism.
Spelling variation list created having 1000 Named Entities in Bengali with an average of 4 spelling variations per word
For each named entity in the NE list, we used other search engines (Google, Yahoo!, Rediff) and collected relevant URL’s. (A total of 33,769 URL’s )
Bengali Stemmer accuracy is improved with inclusion of new words to the root and incorporating additional rules Bengali Monolingual search 9/24/2012 System 1: Basic Sandhan
System 2: Basic Sandhan with Link Rank
System 3: Basic Sandhan with Link Rank & near dedup
System 4: Basic Sandhan with Link Rank, near dedup and content, url and in-link anchor text scoring Summary: p@d (d=5,10) of Systems: 1 – 4 9/24/2012 System 1: Basic Sandhan System 2: Basic Sandhan with Link Rank
System 3: Basic Sandhan with Link Rank & near dedup
System 4: Basic Sandhan with Link Rank, near dedup and content scoring Comparison: p@5 of Systems: 1 – 4 9/24/2012 Ranking Experiments: The Overall Comparison 9/24/2012 We have combined the benefits of link ranking and near duplicate detection with boost factors of content, url and in-linked anchor texts. For this, we made the following changes:

Altered certain boosting in the core indexer module
Indexed in-linked anchor texts for a document
Content Richness based document boosting
Added a URL decoder to index terms from ill-formed URLs

Additionally, we used the following features to award each document a score:
OPIC Score (current scoring strategy of Sandhan)
LINK Score (Score obtained by applying link ranking)
CONTENT Score (Score based on content richness) System 4: Sandhan with link ranking, near dedup algorithm & content, in-link scoring 9/24/2012 Document Reputation:
The goodness of the document can be estimated as follows:

Inlink Score = sqrt(number of ExtraDomain Links) + log(Number of intradomain links) + number of anchorTexts Containing the focused entity
Url Score = (Opic Score + link Score ) * subjective site reputation
overall Document Score = Inlink Score+ Url Score - given as document boost

Here subjective site reputation is based on certain preferences like boosting govt. websites higher Content Richness Computation (contd…) 9/24/2012 Current Sandhan system finds “EXACT Duplicates”, but does not provide facilities to detect near duplicate documents
We developed a near duplicate detection algorithm and eliminated near duplicated documents from the index
Currently there is no evaluation strategy for duplicated results. For a query, if a top ranked document gets duplicated n times then all these n duplicate copies of the documents may also be ranked / clustered together with the original one. This pulls the p@k documents up.
it is essential to identify & remove near duplicate docs
Removal of near duplicates may result in the drop of precision below the precision of System 2. However this increases the diversity of results among top 10 documents. System 3: Sandhan with near duplicate detection algorithm 9/24/2012 System 1:
The Base (current) Sandhan system

System 2:
Web-graph with link rank algorithm

System 3:
Sandhan system with near duplicate detection algo

System 4:
Sandhan with link ranking, near dedup algorithm & content, in-link anchor text scoring Ranking with Sandhan – Our Attempts 9/24/2012 Current:

OPIC score with different boost factors

Web Graph generation & analysis (nutch-0.9 needs upgrade to ver 1.1 or above)

Ranking with results diversification Ranker Experiments by IITKGP 9/24/2012 Points:

Presently, Near Duplicate Detection should be performed as a separate job like parser, indexer

It is implemented in a subsystem which sequentially performs near duplicate document detection and elimination

To make this option scalable, it is necessary to rewrite this subsystem with Map-Reduce framework Near Duplicate Detection – Overheads 9/24/2012 Algorithm: Step - II Near Duplicate Detection and Deletion

For each document di in Master Index do:
If marked_for_delete (di)=true then SKIP to next document
Else current_doc=di;
hashset_currentDoc=getHashset (current_doc);
B= Construct_Boolean_Query (hashset_currentDoc);
Candidate_documentSet = queryInvertedHashIndex (B);
  For TOP k docs dki in Candidate_documentSet do:
candidate_document=dki;
candidate_hash=getHash (candidate_document);
overlap_coefficient=CalculateOverlapCoefficient (hashset_currentDoc, hashset_candidateDoc);
If (overlap_Coefficient greater than threshold)
add candidate_doc to marked_nearDuplicateSet
End if
End For
deleteBasedOnPolicy (marked_nearDuplicateSet UNION current_doc)
End For 9/24/2012  
Hash Generation
 
For each document di in D that has been Crawled and Parsed
tokenSet = getTopNTokens(di, N);
ngramSet = createNGram(tokenSet);
hashSet = GenerateHashes(ngramSet);
buildInvertedHashIndex (hashSet, di);
End For Algorithm: Step - I 9/24/2012 Near Duplicate Detection - Illustration 9/24/2012 We use an XML file to maintain the list of spelling variations

The Synonym Engine parses the XML file containing the spelling variations and creates a common entry in the Inverted Index for all the variant spellings of the same word

During searching, variant spellings are mapped to the normalized entry in the index

Current status:
Synonym List: Generated 4000+ Named Entities with the average of 4 spelling variations per Named Entity in Bengali Language Synonym Injector 9/24/2012 Experiment – 1: Accuracy: 80.58%
# Terms used from Bengali Web docs: 2,070 unique terms
Observation: NEs are getting stemmed correctly in most cases and only suffix stripping takes place in some cases.

Experiment - 2 : Accuracy: 75.21%
# Terms used from Bengali FIRE corpus: 8,146 unique terms Bengali Stemmer Performance 9/24/2012 Index-cml uses two subsystems during indexing

Based on the language of the content, the corresponding language analyzer is invoked

Analysis-XX [ XX = “bn” for Bengali language]
Synonym Injector [presently we provided only for Bengali] Indexing Phase 9/24/2012 URL decoder will be updated in parse-cml:

Example: Currently Sandhan is not handling the decoding of URLs in UTF-8. The Actual URL of the Wikipedia article on “হাওড়া-ব্রিজ-রবীন্দ্র-সেতু” looks as follows:

http://wikimapia.org/7033110/bn/%E0%A6%B9%E0%A6%BE%E0%A6%93%E0%A7%9C%E0%A6%BE-%E0%A6%AC%E0%A7%8D%E0%A6%B0%E0%A6%BF%E0%A6%9C-%E0%A6%B0%E0%A6%AC%E0%A7%80%E0%A6%A8%E0%A7%8D%E0%A6%A6%E0%A7%8D%E0%A6%B0-%E0%A6%B8%E0%A7%87%E0%A6%A4%E0%A7%81

After URL decoding, the following output should be added to the index:
http://wikimapia.org/7033110/bn/হাওড়া-ব্রিজ-রবীন্দ্র-সেতু

Note: The decoded URL generates the tokens: হাওড়া, ব্রিজ, রবীন্দ্র, সেতু which are searchable where as the first URL is not generating these tokens parse-cml: What’s Next? 9/24/2012 Current:
We have collected tourism specific seed URLs in Bengali using popular search engines, blogs, forums and travel sites
Entities and their associated query patterns related to tourism domain are used to generate user information needs:
For example (in English):
For the entity – “Darjeeling”, we have generated query patterns like, “how to reach Darjeeling”, “cheap accommodation in Darjeeling”, “places to see in Darjeeling”, etc

Statistics:
33,769 Bengali seeds collected
Type: Tourism related seed URLs Seed URLs Collection 9/24/2012 Resources Developed in Bengali 9/24/2012 Language Horizontal Tasks - Core Modules:
Parsing (parse-cml plugin)
Indexing (index-cml plugin)
Ranking
Language Vertical Tasks - Additional subsystems:
Language Resources Developed in Bengali
Language Analyzer: analysis-bn (Bengali Analyzer)
Bengali Stemmer
Synonym Injector
Near Duplicate Detection and Elimination CORE Components - By IIT Kharagpur S.No Crawls # Documents # Bengali Docs
----------------------------------------------------------------------------------------------------------------------------------
1. bn-NEs2Urls-All (frozen) – 1, 18, 159 – 77, 986
2. bnCrawl-nes2urls-extn-D2 – 4, 20, 714 – 3, 22, 390
3. bncrawl-26122011 (Frozen)* – 1, 75, 778 – 1, 32, 204
4. bnCrawl-FrozenSeeds-D3 – 18, 19, 097 – 8, 12, 013
[Latest Crawl – Additionally crawled to depth 3]

Details of RJs done with bncrawl-26122011(Frozen)* 81 – 110 Bengali Queries:







* Sanity Check: 9/24/2012 Summary Experiment with different boost factors for NEs
Queries : 81- 110 Bengali Queries
Crawl : 77, 986 Bengali Documents
Fields Boosted : URL(3), title(3), content(10), domain(10), anchor(1); Different boost values for NE 9/24/2012 Experiments on ranking
(With Local Crawl) by varying different boost values Sandhan1: as tested on 17 February 2012
Sandhan2: as tested on 21 February 2012, depth 3 crawl IITB system Sandhan Performance – Comparison* with Google 9/24/2012 Final Results – Bengali Monolingual Retrieval 9/24/2012 System 1: Basic Sandhan System 2: Basic Sandhan with Link Rank
System 3: Basic Sandhan with Link Rank & near dedup
System 4: Basic Sandhan with Link Rank, near dedup and content scoring Comparison: p@10 of Systems: 1 – 4 9/24/2012 System 4: Sandhan with link ranking, near dedup algorithm & content, in-link scoring Evaluator Feedback:
9 documents including news , blogs are found in top 10 search results with good pictures and tourism related information like pilgrimages, tour plan, transports and attractions in Leh Ladakh. Query: লেহ লাদাখ For more details:
http://pr.efactory.de/ 9/24/2012 Similar to page rank algorithm of Google
Measure of finding the relative importance of pages across world wide web What is link ranking? 9/24/2012 Total Execution Time:

Nearly 6 hours Statistics: Corpus: bnCrawl-D2-deDup
# documents
Size of the Corpus : 4,80,681
Number of Duplicate Documents detected : 2,92,548
Number of Unique Documents (NO duplicates) : 1,44,393
Number of Docs having at least one duplicate : 43,740
Number of Documents having ≥ 500 duplicates : 19 Near Duplicate Detection - Experimental Results 9/24/2012 nutch handles the EXACT duplicates, but NOT Near Duplicates
NEAR DUPLICATES: The task to identify and organize documents that are “nearly identical” to each other … that is, the content of one web page is identical to those of another except for a few characters
Our Near Duplicate Detection module identifies and removes near duplicate contents from the index Near Duplicate Detection 9/24/2012 Key phrases extraction from Single Document* (contd…)

5. Calculate χ' 2 value:
For each term w, freq(w, g) = co-occurrence frequency of w with g ∈ C
nw = total #terms in the sentences including w
Calculate χ' 2 value using the equation:



Where

nw is the total number of terms in sentences where w appears and
pg is denoted as (the sum of the total number of terms in sentences where g appears) divided by (the total number of terms in the document)

6. Output: keywords - top m (= 20) terms having the largest χ' 2 value parse-cml: What’s Next? Y. Matsuo, M. Ishizuka, Keyword extraction from a single document using word co-occurrence statistical information, International Journal on Artificial Intelligence Tools, 13(1): 157-169 (2004) 9/24/2012 Key phrases extraction from Single Document* - Algorithm

1. Preprocess:
Stem words by Porter algorithm (Porter 1980)
extract phrases based on n-grams [here we assumed max(n)= 3]
remove stop words using SMART stop words list (Salton 1988)
2. Select frequent terms:
Ntotal = upto 30% of top frequent terms
3. Cluster frequent terms:
Cluster a pair of terms using Jensen-Shannon divergence (> threshold = 0.95 × log 2)
Cluster a pair of terms using mutual information (> threshold = log(2.0) )
C <- #clusters obtained
4. Calculate the expected probability:
ng = # terms co-occurring with g ∈ C
The expected probability pg = ng / Ntotal parse-cml: What’s Next? 9/24/2012 Content Extraction Speed: 0.012 sec per page
[with NO Font transcoder / Language / Domain Identification] parse-cml - Experimental Results 9/24/2012 HTML to CML conversion: An Example 9/24/2012 Evaluator Feedback:
Most of the documents including news, blogs are found diverse in the top 10 search results. These documents contain useful information on tourism, tours, pilgrimages, transports & attractions in Leh Ladakh Query: লেহ লাদাখ System 3: Sandhan with near duplicate detection algorithm – Experimental Results 9/24/2012 Evaluator Feedback:
Only one document of Wikipedia page is found and describes the district of Ladakh mentioning Leh as its largest city including a map and an image.
The rest of the documents are all duplicates. Query: লেহ লাদাখ System 2: Web-graph with link ranking
- Experimental Results 9/24/2012 Evaluator Feedback:
4 documents out of the last 5 results in top 10 documents illustrative pictures and information on tourism with special focus on tour plan, pilgrimages, transport and attractions in Leh Ladakh. The rest of the retrieved pages contains hyperlinks leading to Leh and Ladakh. Query: লেহ লাদাখ System 1: The Base(current) Sandhan system
- Experimental Results 9/24/2012 Actual Query: লেহ লাদাখ

Expanded Lucene Query:
+(NER:লেহ^12.0 content:লেহ^12.0 title:লেহ^12.0 host:লেহ^12.0 url:লেহ^12.0 anchor:লেহ^12.0 NER:লাদাখ^12.0 content:লাদাখ^12.0 title:লাদাখ^12.0 host:লাদাখ^12.0 url:লাদাখ^12.0 anchor:লাদাখ^12.0) +lang:bn domain:tourism
url:"লেহ লাদাখ"~2147483647^3.0 anchor:"লেহ লাদাখ"~4
content:"লেহ লাদাখ"~2147483647^10.0
title:"লেহ লাদাখ"~2147483647^3.0 host:"লেহ লাদাখ"~2147483647^2.0


Total Hits:
3,30,108 documents
Corpus:
bnCrawl-FrozenSeeds-D3 Document
Scoring in Nutch
- An example Details 9/24/2012
Plans to add in the next update:

Content with header and paragraph tags

URL decoder will be updated in parse-cml

Key phrases extraction from each document

Extracting informative segments with their topicality parse-cml: What’s Next? Issues: identified errors in eliminating segments having form feeders, news feeders, copyright info, segments having short messages, menu items in which link-to-text ratio fails Content Extraction Accuracy
The fraction of segments correctly extracted
 ------------------------------------------------------------------
Total number of Segments 9/24/2012 parse-cml - Experimental Results 9/24/2012 CLIA Workshop @ DA-IICT
28 – 30 September 2012 Sandhan – The CLIA system 9/24/2012 Details Overview

Identify certain selected features to find the Focused Entity in a document

Once a Focused Entity is found then measure the support for the entity in the content

Use this measure to score the document Document Fields:

url
Title
Description
Content
Domain
Named Entities
In-links
In-linked anchors text
Crawl Datum Score Focused Entity Identification & Scoring 9/24/2012 Places to See Overview City Info Extracting segments with their topics parse-cml: What’s Next? 9/24/2012 Tagged Content Cleaned Content Site: http://www.exploredarjeeling.com/ Content with header and paragraph tags: parse-cml: What’s Next? Extracted
Key Phrases Darjeeling Glimpse Summer Snow fall Baddogra Siliguri Adventure sports best time to visit Advance hotel booking ava art gallery Buddhism Bhutias beauty of darjeeling batasia loop burdwan palace botanical garden Cultures Climbing Cinchona Common leopard Toy Train 9/24/2012 Cleaned Content Site: http://www.exploredarjeeling.com/ Key phrases extraction from each document content and anchor texts
- An example parse-cml: What’s Next? 9/24/2012 A webpage Advt Actual Content Advt Advt Forms Logo vertical Menu Banner Horizontal Menu Site Details Last update sent on: 23 February 2012 23:59 IST 9/24/2012 Removes noise from web pages & extracts the clean content CMLifier (parse-cml) Index Noise Filtering:
Title, Meta Info,
Out links, Content Lang / Domain Identifier Font Transcoder WWW Web Page Parse/
Parse Data/
Parse Text Current Version: Existing
Duplicate Sentences removal at i) Phrase Level ii) Sentence Level
HTML entities conversion: “” -> “ Meta tags filter improved
Markers added to: title, content, description and keywords
Updated:
Hex Code to Unicode conversion has been improved
topKwords are identified and extracted from document content
Document Boosting score is computed from the content richness CMLifier – (parse-cml) plugin 9/24/2012 Top k Documents Candidate Document Set Document Document to be Retained Documents to be deleted Duplicate Deletion Policy Near Duplicate Documents Near Duplicate Check Inverted Hash Index Boolean Query Hash Query Generator Hash set Near Duplicate Detection (contd…) The CMLifier output (the parsed content) is the input to the indexer
Indexer builds a document object with the following fields: url, title, content, meta keywords, domain, lang, site, NER, MWE, meta description, host, digest, etc
NEs and MWEs are identified during the indexing and the identified NEs and MWEs are added to the fields: NER and MWE.
Each field configured with one of two options:
Indexed (searchable) and Stored:
url, title, content, domain, lang, NER, MWE
The created fields are combined into a single Document object which is written to the Index CMLifier (parse-cml) Document Object Inverted
Index Field n Field 2 Field 1 Parse/
Parse Data /
Parse Text 9/24/2012 Indexer – (index-cml) plugin 9/24/2012 Single Postings list for all synonyms … Single Postings list for all synonyms … Single Postings list for all synonyms … নরিম্যান | নারিমান | নারিমাণ | নরিম্যন | নরীম্যান | নরিম্যাণ | নরিমাণ | নরিমান নিউ | নয়া | নতুন ভারতবর্ষ | ইন্ডিয়া | ভারত | ইন্দিয়া | ইনডিয়া Synonym Engine <synonyms>
<group>
<syn>ভারতবর্ষ</syn>
<syn>ইন্ডিয়া</syn>
<syn>ভারত</syn>
<syn>ইন্দিয়া</syn>
<syn>ইনডিয়া</syn>
</group>
<group>
<syn>নিউ</syn>
<syn>নয়া</syn>
<syn>নতুন</syn>
</group>
<group>
<syn>নরিম্যান</syn>
<syn>নারিমান</syn>
<syn>নারিমাণ</syn>
<syn>নরিম্যন</syn>
<syn>নরীম্যান</syn>
<syn>নরিম্যাণ</syn>
<syn>নরিমাণ</syn>
<syn>নরিমান</syn>
</group>
</synonyms> The Synonym Engine creates a common entry in the Inverted Index for all variant spellings of the same word During searching,
variant spellings are mapped to the normalized entry in the inverted index The list of spelling variations in XML format Synonym Injector (contd…) 9/24/2012 Travel From - to Transportation Historical places Weather hotels Extracting segments with their topics parse-cml: What’s Next? 9/24/2012 START STOP Indexer Link – to – Text
ratio Html Cleaner
Library Parse Object /
Parse (Data / Text) Generate FIELDS: title, url, description, keywords,
topKwords, content, lang, NER,
MWE, boost, digest, domain,
host, site, segment Filter Noisy Blocks Extract Clean CONTENT Parse Obtain HTML Tag Balanced HTML Page Crawled Content parse-cml : Flow Chart Focused Entity Entity Entity Entity Entity Entity Entity Entity Entity 9/24/2012 Entity / Focused Entity 9/24/2012 parse-cml (contd…) Extracted
CLEANED
Content Extract
The Text B7 If you are visiting Agra India for the first time then tourist offices and authorized travel agencies are good place to gather some basic information about the city … B6 Apart from Taj Mahal, Agra has two other world heritage sites one of them is Red Fort and other is Fatehpur Sikri. The Red Fort Agra is also located on the other side of the river Yamuna … B7 B5 B6 Qualified Blocks Link – to – Text
Ratio B5 Agra, the beautiful city is situated on the banks of holy river Yamuna. Agra is the third largest city in the state of Utter Pradesh in India and a prominent tourist destination … Extracted Block Text Structural
Mining B10 B7 B5 B6 B2 B4 B3 B9 B8 B1 Segmenting into blocks Web Page Initially, Documents are given null score
Then the score of each qualified segment is constructed incrementally Dr Rajendra Prasath
Full transcript