Property ZOOMA 2 A repository of annotation knowledge and an API for automatic curation Annotation Semantic Tag Biological
Entity Study hasBody hasSemanticTag hasTarget annotates isPartOf Background Ontology Mapping and Cleanup Annotation
Provenance hasProvenance Tony Burdett, Functional Genomics Production Team, EBI The Functional Genomics Production Team deal with all submissions to ArrayExpress, the Gene Expression Atlas, and the BioSamples database.
There are almost 35,000 experiments in ArrayExpress, and 3,500 of these have been highly curated for eligibility in the Atlas We have a team of curators who process around 100 experiments a month for ArrayExpress.
Some of these are eligible for the Atlas, and require deeper curation The use of ontologies to annotate data is an established method for adding semantics to metadata, facilitating integration and richer querying.
We map data to the Experimental Factor Ontology, an application focused ontology modelling experimental variables
Mapping to ontologies in this way helps to harmonise data across experiments and drive searches and data visualisation
Generating these mappings is manually intensive
Ontologies evolve, and technologies used to perform experiments is continually improving
This causes an Definition:
A pair of strings, forming a typed description of an attribute about some biological entity. Sample and Assay Description We request that all submissions to ArrayExpress include some descriptive information about assays and samples.
Essentially, these are typed text strings - e.g. "Organism: Homo sapiens" or "Cell line: HeLa"
This gives us something to work from in annotating samples and assays against EFO. Submitted Data Properties represent the minimal amount of information we need in a submission, but...
they are rarely harmonised
and can be ambiguous.
Submitters often don't tell us the things we actually want to know
We also end up with a few properties that are extremely highly reused, and many many properties used only once or twice. Sex: male
Gender: male Definition:
An identifiable concept, with an assigned URI, that formally defines the semantics of the concept. Normally this is an ontology term. Semantic Tags In ArrayExpress, we almost always annotate to EFO (although many EFO terms are imported from other ontologies). Definition:
A single link, or mapping, between a property and a semantic tag that can be asserted in a given context. The Annotation Process Linking properties and ontology terms together is an expensive, highly manual process
Typically this happens either by editing MAGE-TAB files, scripting, SQL updates or using curation tools.
Curators spend a lot of time annotating the same types of data, or making the same sorts of fixes, over and over ZOOMA By creating a repository of annotations, and scoring their quality, we have been able to build a "smart" annotation search service.
The data we used has been curated by hand, so:
it is a very rich source of knowledge
performs better than text matching
can capture more obscure types of annotations (for example, involving compound properties). Definition:
Any physical entity that is part of an experiment and has a series of attributes. In our case, this is a sample, an assay or a SNP Definition:
An experiment that generated some data that needs to be described. Benefits We can drop the ZOOMA autocomplete widgets into our submission tools.
This should encourage better, more consistent submissions across all our datasources with less curator intervention required. Some of the variability in annotations we see is covered in the ontologies we use by class names and synonyms.
By capturing a measure of the variability, including things that are not synonymous, we provide a resource for enriching EFO.
In addition, we record how frequently ontology classes are annotated against, highlighting the important concepts in the domain ZOOMA ZOOMA is a and highly annotated data.
It has been seeded with manually curated data about biological concepts from ArrayExpress, Atlas databases as well as the NHGRI GWAS catalog and EFO.
ZOOMA provides a service that allows querying by plain text and returns possible annotations between matching properties and concepts identified by a URI.
This makes it possible to exploit these high quality annotations to enrich other datasources through reliable Disease State: Cancer
Genotype: arm Organism: Homo sapiens
Organism: Mus musculus
Cell Type: Embryonic stem cells
Cell Line: tt2-g9aflox/delta+oht
x 1 This model is built using the Open Annotation Model proposed by the Open Annotation Community Group (http://www.w3.org/community/openannotation) Zooma stores:
The source of this annotation (e.g. the database)
The "creator" of this annotation (a person or a script)
The date the annotation was created
An evidence code describing how this annotation was made We can use this information in scoring algorithms during annotation searching, and data mining Definition:
A record of provenance information relating to how and when a known annotation was created or modified Integration Linking samples and assays to ontology terms allows us to do much richer searching and visualisation.
It also provides a means to integrate across experiments. ZOOMA jQuery plugin ZOOMA REST-like API ZOOMA REST-like API ZOOMA User Interface ZOOMA User Interface ZOOMA REST-like API ZOOMA SPARQL Querying "Biological Entity" We don't care what these things are or what they look like: ZOOMA isn't trying to integrate data on the level of biological entities.
ZOOMA contains this concept simply as a means to group annotations together in some context.
This allows for semantic similarity searching -
We can also cross-link between ZOOMA annotations and other resources, for example the BioSamples database. Studies Again, we're not trying to model studies in ZOOMA, but they exist to provide an additional (more abstract) level of association and grouping between annotations.
So each annotation can be made in the context of a biological entity, and biological entities exist in the context of studies.
Usually this will mean that annotations are consistent within studies (as they were probably asserted at the same time).
We can also link studies to the source databases e.g. ArrayExpress ZOOMA Services We've built a ZOOMA annotation repository and search service
RDF graph with a schema based on the Open Annotation Model
Results are filtered, scored and ranked
User interface and a REST-like API
Supports querying using Freebase Suggest
Enables automated annotation as well as the ability to spot inconsistencies and errors. /zooma/v2/api/search?query=small+cell+lung+cancer /zooma/v2/api/summaries/FE76A21448E5B38BDB846D044556AF34F8AA72BB /zooma/v2/api/annotations/ANNO_00302281 Benefits Benefits Automatic curation!
We can see the common annotations that have been applied before, which of these are most trusted, and find weird corner cases or odd descriptions and automatically reapply them. Benefits By tracking the context of annotations across all resources, we can do semantic distance queries across samples.
For example, want to do cross-datasource integration in the BioSamples database.
We want to infer possible relations between samples based on shared sets of annotations Benefits to You We expect many groups have exactly the same "annotation gap" problem that we do.
Hopefully, our "curator knowledge" resource will be useful to the community in remapping data.
We'd be really interested to hear from people who would like to consume this resource
...and annotation providers! Acknowledgement organism: homo sapiens sex: female gender: female diseasestate: breast cancer frequency = 2285 This chart shows the 200 most frequently used properties in ArrayExpress.
There are nearly 250,000 in total, and about half of these are only ever used once. frequency = 1288 celltype: peripheral blood mononuclear cell organism: homo sapiens celltype: hepatocyte sex: female organism_part: lung This chart shows the 200 most frequently used properties in Atlas.
There are 25,000 unique properties in the Atlas, an order of magnitude less than in ArrayExpress Everyone in the EBI FG Production, Development and Atlas Teams
Helen Parkinson Simon Jupp
Dani Welter Funding... EMBL-EBI,
NCBO (U54-HG004028) Thankyou! Questions?
ZOOMA live demo: http://wwwdev.ebi.ac.uk/fgpt/zooma
ZOOMA blog post by James Malone: http://goo.gl/Dktsn "annotation gap" linked data repository of annotation knowledge automatic annotation. Benefits to us We already use ZOOMA
in our Atlas data release process
as part of the GWAS catalog curation
And we hope to use ZOOMA soon
as part of the Ensembl Variation release pipeline
as part of the Biosamples loading pipeline
at the point of submission to ArrayExpress
to automatically clean up older datasets in ArrayExpress "Show me all biological entities that are from human lung in patients with cancer" Compound: Valproic acid
0.8 millimolar CHEBI:39867 Valproic acid
score 36.9 VPA24h_8 extract E-TABM-903See the full transcript