Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Battle of the Search Engines!

No description

Sarah Loveland

on 9 December 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Battle of the Search Engines!

Simplified summary
of the search process The Main Idea The rank matrix If one page Q links
to your page P: PageRank is determined
by 3 factors PageRank HITS Two Theses for IR Systems PageRank
HITS What is
retrieval? BATTLE
of the
Search Engines Sarah Loveland & Mike Pflueger Where P is the matrix with: 1. Full document scan determines the subset of pages based on the search
query. (This is called the Relevancy Set)

2. Relevancy set is sorted according to PageRank. Links from important sites should be worth more

Links to your site should be weighted by the number of links on the linking page "the techniques of storing and recovering and often disseminating recorded data especially through the use of a computerized system" http://www.merriam-webster.com In other words, it's how you find information on the internet. 1. The Number of pages linking to your page
2. PageRank of the pages linking to your page
3. The number of outgoing links from the pages linking to your page CHALLENGES
to creating
IR systems THE WEB WEB USERS LANGUAGE enormous document collection It's huge! no editorial review process It's uncontrolled! frequent (almost constant) updates It's exploding! use very short queries rarely use feedback to
revise searches rarely use advanced features only view top 10-20 results try to trick IR systems to get good rankings homonyms synonyms "bank" "automobile" "car" "vehicle" The thesis of HITS is good hubs good authorities good hubs good authorities Authority Hub A sample neighborhood graph containing 5 nodes Since N can become huge, the maximum number of nodes added is fixed. All documents containing references
to query terms are put into N. Step 1: Create neighborhood graph, N page i has authority score x and hub score y Step 2: Compute
authority and hub scores One method consults inverted term-document file. A search for terms 10 and 11 would pull all documents listed for these terms into N. Graph is expanded by adding nodes pointing to those in N or nodes that are pointed to by those in N. This allows for other associations, e.g. involving synonyms. i i adjacency matrix L is formed using N for sample graph, L = 0 1 1 0 0
1 0 0 1 0
0 0 0 1 0
0 1 1 0 1
0 1 0 0 0 initial scores x and y refined by computing i i (0) (0) Using L, equations are rewritten in matrix form: These are used in the iterative algorithm for computing ultimate authority and hub scores, x and y. 1. Initialize y =e where e is a column vector of all ones 2. Until convergence, do which defines the iterative power method for computing the These equations can be simplified by substitution to (0) "The heart of our software is Page Rank" dominant eigenvector for the matrices and : authority matrix : hub matrix Computing the authority vector, x, and the hub vector, y, can be viewed as finding the dominant right-hand eigenvectors of and Web Crawlers index and catalog
webpages across the internet Ranking is assigned before a
search query is even made 2. Villanova.edu 1. Dr. Feeman's homepage 3. www.thetimes.co.uk 4. Facebook.com 5. ESPN.com 2. Villanova.edu 1. Dr. Feeman's homepage 3. www.thetimes.co.uk 4. Facebook.com 5. ESPN.com The sum of the rankings of all the
pages linking to your page divided by
the number of outgoing links provides
your page's ranking STRENGTHS WEAKNESSES dual rankings makes web IR into small problem cost reduction query dependence susceptibility to spamming topic drift QUESTIONS? Why do Google and Bing
give different search results? And
Full transcript