Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Automatic Construction of Large Scale Knowledge Bases from Textual Data

Semantic Search Enabling

Vasudeva Varma

on 10 October 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Automatic Construction of Large Scale Knowledge Bases from Textual Data

Semantics in
Computing's Future Semantics Vasudeva Varma
www.iiit.ac.in/~vasu Where can they help? Automatic Construction of
Large Scale Knowledge Bases
from Textual Data Outline Of the Talk:

Semantic Search - The Promise, Possibilities and Problems
Semantics - Past, Present and Future
Turning Web into a Knowledge Base
Improving Knowledge Bases - Structured Embeddings, Relation adoption Search and related Applications
Pattern Extraction and Mining
Situational Awareness
Question Answering Semantics:
A New Promise! We Believe in Semantics For:
Intelligent Processing and Reasoning
Knowledge Enabled Computing
Abstractions and Human Experience What is possible?
Know everything Wikipedia knows
Know everything that is machine readable?
Collection of all entities, classes, relations and facts Search is not the KING Turn the Web into Knowlede Base Harvesting Facts Image Courtesy: Gerhard Weikum, Max Planck Institute for Informatics Entity recognition & disambiguation
Understanding natural language & speech
Knowledge services & reasoning for semantic apps, Deep Q&A
Semantic search: precise answers to advanced queries Questions:
Who was the President of US when Barak Obama was born?
Politicians who are also scientists?
Relationship between Rajanikant, Ramayan and Cloud Computing? State of the Art But, Now...
GBY! delivered minimal Ontologies + Metadata + domain specific KBs
Concept/object base of facts in one domain at a time at web scale (FreeBase, MusicBrainz)
Schema.org (conceptual models for common domains)
Microdata and RDFa - Embeddeding Semantic Knowledge in HTML

=> Contributing to growing and synergetic semantic web ecosystem High scale scientifc Computing Agreement between Physical-Social-Cyber Worlds Search Applications
Enable Auto-faceted search by mining semantic metadata
Extract/Integrate/Re-purpose high quality datasets to populate ontologies with facts/background information Limitations
"Perception" on Scalability (DMZ, Y! Directories)
Extracting semantics in bottom-up manner vs. Scaling of relevant ontologies, models, background information
Connecting silos of bigdata
Structured and unstructured data silos DIPRE: Dual Interative Pattern Relation Extraction 300 ontologies for BioPortal for Life Sciences
Web of Data - 25 Billion triplets (triples every year)
5B+ Triplets in BioRDF Semantics come from agreement - role in integration of data across platforms and syntatic systems
Critical in integrating objects that straddle virtual-physical (or Physical-social-cyber) divide Inferencing Knowledge Base
Population Inconsistency


Accuracy of facts

Novel information

Cost of Manual efforts Solution:

Automatically updating information of the entities in knowledge bases KBP is broken down into two sub problems

Entity Linking : Linking entity mentions in documents to Knowledge Base nodes

Slot Filling : Extracting attribute information for query entities Structured Embeddings Discover information about named entities and to incorporate this information in a knowledge source

TAC's contest is to evaluate the ability of automated systems in such tasks Multi-Task Learning:
Relation adoption Scalable Knowldge Harvesting with high precision and hig recall [Nakashole et al, 2011, WSDM] N-Gram Itemsets for richer patterns as extension to PROSPERA
Use MaxSat based Constraint reasoning for validating facts/quality of Patterns
Use Pattern occurance statistics for:
Prune hypothesis space
Dervice information weights of clauses for the reasoner Pettern Based Systems (good for recall)
Reasoning Enhanced Systems (good for precision)
Constraint/consitency aware systems (Scalability/efficiency issues) Structured Embeddings of Knowledge Bases Each has different rigorous symbolic framework (Hard to use one system's data in another)

A learning process that embeds symbolic representatons into a more flexible continuous vector space in which original knowledge is kept and enhanced.

Each entity of a knowledge base (such as Word- Net) is encoded into a low dimensional embedding vector space preserving the original data structure.

These learnt embeddings would allow data from any KB to be easily used in machine learning methods for prediction and information retrieval. (e.g. Freebase and WordNet) We use WordNet, ConceptNet and Raw Text as Knowledge sources
Meaning Representations (MRs) induced from raw text and MRs of WordNet are embedded in the same space
Allows us to learn to perform disambiguation on raw text
MR prediction can also be seen as knowledge extraction
Extracted MRs from raw text potentially enriches WordNet Problem: Accurately detecting the semantic relations between two entities State of the art: Supervised learning algorithms (requiring adequate amount of labelled data for each of the target relation types)

Idea: Adapt an existing relation extraction system to study new relation types using a small set of training instances (weakly supervised setting) SemantiFire Automatically generates ontology from documents.
Generated ontology is W3C OWL/RDF compliant.
SPARQL engines can be used to query the ontology.
Alternatively SemantiFire can provide intelligent keyword-based searches.
High-speed, scalable product. Pain Points

High-speed data processing and organization
Data integration/interoperability for unstructured and semi-structured content.
Semantic Processing of queries while searching
SemantiFire is a tool to quickly semantify an organization's content using a semi-automatic process.
Generates OWL-RDF ontology.
Ontology creation and annotation as a continuous process.
Can be tied with CMS to auto-tag content with upper ontologies.
 Exploits semi-structure of documents + statistics + general models mined from web.
"Get the resumes of all candidates who worked for fortune 500 companies"

"show me all emails related to the sales of product X" Express the relation R(A,B) as a feature representation

Use lexical and syntactic features/patterns Find Relation Independent (RI) and Relation Specific (RS) Patterns construct a bipartite graph, G = (VRS VRI , E) Start (cc) image by nuonsolarteam on Flickr Relation Specific Pattern Relation Independent Pattern each edge eij of εE is associated with a non-negative weight mij
of the endge weight matrix M of Bi-Partite Graph, G Given M, we compute a projection matrix U with a lower dimension in the latent space.

The low-dimensional projection reduces the mismatch between patterns in source and target relation types,

Enabling us to train a classifier for the target relation type using labeled entity pairs for both source and target relation types. End Semantics is a tough area
The field promised more and delivered less - A failed promise?
An overused and abused term causing a lot of confusion
Yet to be proven Big Data Computing What do we do?
Full transcript