Cyc-Wikipedia Mapping

No description

Vijay Raj

on 6 May 2010

Transcript of Cyc-Wikipedia Mapping

Mapping Ontologies Cyc <-> Wikipedia (#$isa #$Cyc ?WHAT) What is Wikipedia? #$SoftwareAgent
Knowledge base Individuals Some "thing".. Any "thing"
Can have parts
Can not have instances Mountain Standard Time
Longs Peak
International Space Station
Brad Pitt
United States Army
Alice in Wonderland X-ray
Sleepy lion
Suncor oil
Paige Miles
Assam Rifles
Collections Always, the first question? ... What IS that? X-ray
Sleepy Lion
Suncor oil
Paige Miles
Assam Rifles
Gilgamesh Time Zone
Hiking Trail
Military Organization
Book Collections give perspective
Enables educated guess
Share features or properties (#$isa ?IND ?COL) - 1.2 million
(#$genls ?COL1 ?COL2) - 120 thousand
#$genls is transitive and reflexive Microtheory Time/Space Context Construct
Any assertion is VALID ONLY IN a Microtheory
Why is it necessary?
Assertion: (#$isa #$Cyc #$SoftwareAgent) Indexing efficiency by knowledge isolation
Context assumption - BoulderColorado Context
Consistency enforcement Predicates and Functions #$isa, #$genls are predicates

Predicates relate two or more things

Predicates specify property of things

Predicates arguments are constrained

Functions create new individuals Encyclopedia
Knowledge base? Categories The Good

A lot of "is a", Rivers by country, Basketball players, Phone Companies

A lot of "genls", Geography -> Water bodies -> Streams -> Rivers

A lot of predicates, "Yale Alumni", "Grammy Award winners", "1947 Births"
The Bad

Not "is a" always, Rivers category has "Drainage", "River surfing", "Dams". Mostly "is about"..

Categories used as Microtheories, "US Elections in 2010", "History of Spain"

"Genls" not consistent, Wiki -> Ontology, can't be automated

Category's main article vs category, have same name
The Bad

Global ID lacking, article titles keep changing.

"The Explosive, The BOB, The Chef", overly general synonyms

Disambiguation in title, (band), (politician) vs depth (/artist/band) Freebase type

Search space always global, only seperated by language
Article Title The Good

Mostly equivalent to Individuals, sometimes Collections

Redirects, "article interlinks" are mostly equivalent to synonyms

Mostly about a single topic, event, abstract concept (Bird_Flight)

Each article is an "instance of" a Category Infoboxes/Tables The Good

Huge amount of domain specific info. Tournaments won/ scientific classification

New predicate harvesting - Collective consensus on whats important The Bad

No re-use/heirarchy (genlPred), Snooker/Tennis player, can share basic bio.

No consistency, Snooker: "Nationality", Tennis: "Country"

Not easy to parse. Too many formats for same field, no array (tributaries)

Tables are even less structured, but hold even more information

Article Text The Good

Can't say enough about it. Holds millions of man-hours work. Active community

Tagging[[ | ]], makes disambiguation easy

Millions of assertions, breadth/depth, ontologist's dream

References, makes getting redundancy easier The Bad

Its still geared towards humans

Predicates are not available, not much extra effort to tag #$Orion-Constellation What is it?
(#$isa #$O-C #$Constellation)

English name for it?
(#$nameString #$O-C "Orion")
(#$nameString #$O-C "the Hunter") Basic Info Advance Info (#$celestialSubRegion #$O-C #$OrionsBelt)

(#$inRegion #$Rigel-Star #$O-C)

(#$inRegion #$Betelgeuse-Star #$O-C) Term Cloud Orion
Constellation Celestial Sub Region
Orion's Belt Three good terms enough for a match (Google, only Orion vs Orion + Constellation/Hunter
Bank vs Bank + River Mapping Create search set Proximity metric Orion (constellation) Basic Info Redirects

The Hunter Categories


North Constellations

Orion Constellation Article Info "Orion, often referred to as The Hunter, is a prominent constellation located" "A line from Rigel through Betelgeuse points to Castor and Pollux" "the three stars in Orion's Belt" Article Title
Wikipedia Redirects Cyc #$nameString Article Interlinks [[ Art | Syn]] 163K Cyc Individuals/
Collections 1.7M Wikipedia Articles 80K Useful Cyc Constants

45K Article/Cyc name match
(Sometimes not accurate)

15K Non-trivial matches

Rest: No match Examples RescuingSomeone<cycSep>Rescue<cycSep>999
Reference/Research My Links (Expect inaccuracies)


Rigorous Research

http://www.cs.waikato.ac.nz/~olena/publications/Medelyan_Legg_Wikiai08.pdf Term Cloud weight - number of times any term in term cloud is found in article

Term weight vector match - vector of number of times each term is found in article

Wikipedia synonym weight - In article set, one with most interlinks with synonym equal to Cyc #$namesString

Wiki synonyms article set, Wiki "Cyc context" article set, and "Hyperlink interconnection metric".
Exact Match Partial Match Article Title
Wikipedia Redirects Exact Match Exact Match Cyc Genls/Siblings Article Title/Redirects
Full transcript