Prezi

Share this prezi

Who can edit:

Present Online

Send the link below via email or IM to invite your audience

Copy

Start the presentation

Start presenting

  • Invited audience will follow you as you navigate and present
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can view together your prezi
  • Learn more about this feature in the manual

Download prezi for:

Present offline on a PC or Mac.

  • Embedded YouTube videos need an active Internet connection to play.
  • Portable prezis are not editable.

Edit and present offline with Prezi Desktop

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

Zooma 2 ~ ISMB

A repository of annotation knowledge and an API for automatic curation
by Tony Burdett on 9 April 2013

Comments (0)

Please log in to add your comment.

Report abuse

Prezi Transcript

Property ZOOMA 2 A repository of annotation knowledge and an API for automatic curation Annotation Semantic Tag Biological Entity Study hasBody hasSemanticTag hasTarget annotates isPartOf Background Ontology Mapping and Cleanup Annotation Provenance hasProvenance Tony Burdett, Functional Genomics Production Team, EBI The Functional Genomics Production Team deal with all submissions to ArrayExpress, the Gene Expression Atlas, and the BioSamples database. There are almost 35,000 experiments in ArrayExpress, and 3,500 of these have been highly curated for eligibility in the Atlas We have a team of curators who process around 100 experiments a month for ArrayExpress. Some of these are eligible for the Atlas, and require deeper curation The use of ontologies to annotate data is an established method for adding semantics to metadata, facilitating integration and richer querying. We map data to the Experimental Factor Ontology, an application focused ontology modelling experimental variables Mapping to ontologies in this way helps to harmonise data across experiments and drive searches and data visualisation Generating these mappings is manually intensive Ontologies evolve, and technologies used to perform experiments is continually improving This causes an Definition: A pair of strings, forming a typed description of an attribute about some biological entity. Sample and Assay Description We request that all submissions to ArrayExpress include some descriptive information about assays and samples. Essentially, these are typed text strings - e.g. "Organism: Homo sapiens" or "Cell line: HeLa" This gives us something to work from in annotating samples and assays against EFO. Submitted Data Properties represent the minimal amount of information we need in a submission, but... they are rarely harmonised and can be ambiguous. Submitters often don't tell us the things we actually want to know We also end up with a few properties that are extremely highly reused, and many many properties used only once or twice. Sex: male Sex: m Gender: male Definition: An identifiable concept, with an assigned URI, that formally defines the semantics of the concept. Normally this is an ontology term. Semantic Tags In ArrayExpress, we almost always annotate to EFO (although many EFO terms are imported from other ontologies). Definition: A single link, or mapping, between a property and a semantic tag that can be asserted in a given context. The Annotation Process Linking properties and ontology terms together is an expensive, highly manual process Typically this happens either by editing MAGE-TAB files, scripting, SQL updates or using curation tools. Curators spend a lot of time annotating the same types of data, or making the same sorts of fixes, over and over ZOOMA By creating a repository of annotations, and scoring their quality, we have been able to build a "smart" annotation search service. The data we used has been curated by hand, so: it is a very rich source of knowledge performs better than text matching can capture more obscure types of annotations (for example, involving compound properties). Definition: Any physical entity that is part of an experiment and has a series of attributes. In our case, this is a sample, an assay or a SNP Definition: An experiment that generated some data that needs to be described. Benefits We can drop the ZOOMA autocomplete widgets into our submission tools. This should encourage better, more consistent submissions across all our datasources with less curator intervention required. Some of the variability in annotations we see is covered in the ontologies we use by class names and synonyms. By capturing a measure of the variability, including things that are not synonymous, we provide a resource for enriching EFO. In addition, we record how frequently ontology classes are annotated against, highlighting the important concepts in the domain ZOOMA ZOOMA is a and highly annotated data. It has been seeded with manually curated data about biological concepts from ArrayExpress, Atlas databases as well as the NHGRI GWAS catalog and EFO. ZOOMA provides a service that allows querying by plain text and returns possible annotations between matching properties and concepts identified by a URI. This makes it possible to exploit these high quality annotations to enrich other datasources through reliable Disease State: Cancer Genotype: arm Organism: Homo sapiens x 492,000 Organism: Mus musculus x 149,000 Sex: Male x 76,000 Cell Type: Embryonic stem cells x 1762 Cell Line: tt2-g9aflox/delta+oht x 16 BMI: 41.6261 x 1 This model is built using the Open Annotation Model proposed by the Open Annotation Community Group (http://www.w3.org/community/openannotation) Zooma stores: The source of this annotation (e.g. the database) The "creator" of this annotation (a person or a script) The date the annotation was created An evidence code describing how this annotation was made We can use this information in scoring algorithms during annotation searching, and data mining Definition: A record of provenance information relating to how and when a known annotation was created or modified Integration Linking samples and assays to ontology terms allows us to do much richer searching and visualisation. It also provides a means to integrate across experiments. ZOOMA jQuery plugin ZOOMA REST-like API ZOOMA REST-like API ZOOMA User Interface ZOOMA User Interface ZOOMA REST-like API ZOOMA SPARQL Querying "Biological Entity" We don't care what these things are or what they look like: ZOOMA isn't trying to integrate data on the level of biological entities. ZOOMA contains this concept simply as a means to group annotations together in some context. This allows for semantic similarity searching - We can also cross-link between ZOOMA annotations and other resources, for example the BioSamples database. Studies Again, we're not trying to model studies in ZOOMA, but they exist to provide an additional (more abstract) level of association and grouping between annotations. So each annotation can be made in the context of a biological entity, and biological entities exist in the context of studies. Usually this will mean that annotations are consistent within studies (as they were probably asserted at the same time). We can also link studies to the source databases e.g. ArrayExpress ZOOMA Services We've built a ZOOMA annotation repository and search service RDF graph with a schema based on the Open Annotation Model Results are filtered, scored and ranked User interface and a REST-like API Supports querying using Freebase Suggest Enables automated annotation as well as the ability to spot inconsistencies and errors. /zooma/v2/api/search?query=small+cell+lung+cancer /zooma/v2/api/summaries/FE76A21448E5B38BDB846D044556AF34F8AA72BB /zooma/v2/api/annotations/ANNO_00302281 Benefits Benefits Automatic curation! We can see the common annotations that have been applied before, which of these are most trusted, and find weird corner cases or odd descriptions and automatically reapply them. Benefits By tracking the context of annotations across all resources, we can do semantic distance queries across samples. For example, want to do cross-datasource integration in the BioSamples database. We want to infer possible relations between samples based on shared sets of annotations Benefits to You We expect many groups have exactly the same "annotation gap" problem that we do. Hopefully, our "curator knowledge" resource will be useful to the community in remapping data. We'd be really interested to hear from people who would like to consume this resource ...and annotation providers! Acknowledgement organism: homo sapiens sex: female gender: female diseasestate: breast cancer frequency = 2285 This chart shows the 200 most frequently used properties in ArrayExpress. There are nearly 250,000 in total, and about half of these are only ever used once. frequency = 1288 celltype: peripheral blood mononuclear cell organism: homo sapiens celltype: hepatocyte sex: female organism_part: lung This chart shows the 200 most frequently used properties in Atlas. There are 25,000 unique properties in the Atlas, an order of magnitude less than in ArrayExpress Everyone in the EBI FG Production, Development and Atlas Teams Especially... Eleanor Williams Maria Keays Robert Petryszak Adam Faulconbridge Helen Parkinson Simon Jupp James Malone Dani Welter Funding... EMBL-EBI, NCBO (U54-HG004028) Thankyou! Questions? ZOOMA live demo: http://wwwdev.ebi.ac.uk/fgpt/zooma ZOOMA blog post by James Malone: http://goo.gl/Dktsn "annotation gap" linked data repository of annotation knowledge automatic annotation. Benefits to us We already use ZOOMA in our Atlas data release process as part of the GWAS catalog curation And we hope to use ZOOMA soon as part of the Ensembl Variation release pipeline as part of the Biosamples loading pipeline at the point of submission to ArrayExpress to automatically clean up older datasets in ArrayExpress "Show me all biological entities that are from human lung in patients with cancer" Compound: Valproic acid 0.8 millimolar CHEBI:39867 Valproic acid maps to CHEBI:39867 8 times, score 36.9 VPA24h_8 extract E-TABM-903
See the full transcript