Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

SIB Training on EBI RDF Platform 1/12/2015

No description
by

Marco Brandizi

on 23 February 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of SIB Training on EBI RDF Platform 1/12/2015

Why Linked Data (aka Semantic Web) for Life Science
Huge amount of data/information (sequencing, microarrays, clinical data, pathways, protocols, you name it)
and counting (NGS, personalised medicine, IoT & Healthcare
Heterogeneity, which needs to be integrated (from protein structures to population genetics, e.g. in immunology or neurology, comparing animal models, microbiology, ecology)
Makes cooperation & sharing a compelling need
Different/conflicting terminology (what is a gene? How is it identified? How to model an abnormal organ?)
World Wide Web is a powerful sharing/communication mean for humans
Semantic Web for open data and machines
Why Linked Data @ EBI?
Why not?! :-)
Many were already doing it in 2013, including internal pioneers
Interests from communities, e.g., Industry Program, pharma
Feeling that technology is maturing, community is growing
Introduction:
The EBI and the RDF Platform
EBI RDF Platform, Rationale and Implementation
Initially address technical users, others in future
RDF/SPARQL/etc is now well-known, but within reach
Reflect our organisation: multiple, relatively independent datasets
Harmonise/coordinate it
An RDF 'meta-team'
Common policies/practices:
URI patterns and resolution
Sharing ontologies
and dataset schemas ('application ontologies')
Sharing infrastructure, eg, common configuration/scripts/interface
Evaluate (in Nov 2015)
EBI RDF Platform, Rationale and Implementation
Courtesy of: J. McMurry, Biomedbridges
Gene Expression Atlas
Example: Which experiments contain sample descriptions that mention diabetes?
?experiment a atlasterms:Experiment.
?experiment dcterms:description ?description.
?experiment atlasterms:hasAssay ?assay.
?assay atlasterms:hasSample ?sample.
?sample atlasterms:hasSampleCharacteristic ?c.

?c
atlasterms:propertyType ?propertyType ;
atlasterms:propertyValue ?propertyValue.
SELECT DISTINCT
?experiment ?description
?propertyValue ?propertyType
WHERE
{
?experiment a atlasterms:Experiment.
?experiment dcterms:description ?description.
?experiment atlasterms:hasAssay ?assay.

?assay atlasterms:hasSample ?sample.

?sample atlasterms:hasSampleCharacteristic ?c.

?c
atlasterms:propertyType ?propertyType ;
atlasterms:propertyValue ?propertyValue.

FILTER ( regex (?propertyValue, "diabetes", "i") ).
}
ORDER BY
?experiment
http://goo.gl/dXgsjY
SELECT DISTINCT
?upDown ?dbXref ?pvalue ?propertyValue
WHERE
{
?expRatio a atlasterms:IncreasedDifferentialExpressionRatio;
rdfs:label ?upDown;
atlasterms:pValue ?pvalue;
atlasterms:hasFactorValue ?factor;
atlasterms:isMeasurementOf ?probe .

?factor rdf:type efo:EFO_0000270;
atlasterms:propertyValue ?propertyValue.

?probe atlasterms:dbXref ?dbXref .
}
ORDER BY
?pvalue
http://goo.gl/hIXkux
Which genes are expressed under the 'asthma' condition?
Find Protein targets for gleevec (CHEMBL941)
Activities are instances ('a', or rdf:type) of cco:Activity
Explore http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL941
how to link an activity to this molecule (click on the property headers to get the URI)?
how to link cco:Activity to assays?
Assays have targets, via which property (again, explore from activity, once you created the link to assay)?
Targets have components, and components have references
We're interested in UniProtRef only, how to do it via graph pattern?
how can you constraint a pattern component to be only about Homo Sapiens? (http://identifiers.org/taxonomy/9606
SELECT distinct ?dbXref
WHERE {
?act a cco:Activity;
cco:hasMolecule chembl_molecule:CHEMBL941;
cco:hasAssay ?assay .

?assay cco:hasTarget ?target .

?target cco:hasTargetComponent ?targetcmpt .

?targetcmpt cco:targetCmptXref ?dbXref;
cco:taxonomy <http://identifiers.org/taxonomy/9606> .

?dbXref a cco:UniprotRef
}
http://goo.gl/KaC4Ie
Federated queries across EBI datasets
Find the protein targets for gleevec and show where/how they are differentially expressed.
Start from the CHEMBL endpoint
Go back to previous examples, start with the protein targets query
Use the SERVICE keyword (http://www.w3.org/2009/sparql/docs/fed/service) with the Atlas endpoint:
<http://www.ebi.ac.uk/rdf/services/atlas/sparql>
Repeat the pattern from the previous example about the Atlas, put it inside the SERVICE block, link ChEMBL/Atlas via dbXref
Remove any EFO restriction about asthma
SELECT DISTINCT ?dbXref ?upDown ?propertyValue ?pvalue
WHERE
{
?act a cco:Activity;
cco:hasMolecule <http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL941>;
cco:hasAssay ?assay .

?assay cco:hasTarget ?target .

?target cco:hasTargetComponent ?targetcmpt .

?targetcmpt cco:targetCmptXref ?dbXref;
cco:taxonomy <http://identifiers.org/taxonomy/9606> .

?dbXref a cco:UniprotRef

SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql>
{
?probe atlasterms:dbXref ?dbXref.

?expRatio
atlasterms:isMeasurementOf ?probe;
#a atlasterms:IncreasedDifferentialExpressionRatio;
rdfs:label ?upDown;
atlasterms:pValue ?pvalue;
atlasterms:hasFactorValue ?factor;
atlasterms:isMeasurementOf ?probe .

?factor atlasterms:propertyValue ?propertyValue.

FILTER ( ?pvalue < 1E-6 ).
}
} # ORDER BY ?pvalue
http://goo.gl/zkB2hV
For the proteins associated to the pathways, tell their GeneOntology associations
Selecting proteins & GO labels:
http://goo.gl/PBGokZ
Find pathways and proteins involved in pathway components
SELECT DISTINCT
?pathway ?pathwayName ?dbXref
WHERE
{
?pathway rdf:type biopax3:Pathway;
biopax3:displayName ?pathwayName;
biopax3:pathwayComponent ?reaction.

?reaction rdf:type biopax3:BiochemicalReaction.

{
{ ?reaction ?reactRel ?protein }

UNION
{
?reaction ?reactRel ?complex.
?complex rdf:type biopax3:Complex.
?complex ?cpxRel ?protein. }
}

?protein rdf:type biopax3:Protein;
biopax3:entityReference ?dbXref
}
http://goo.gl/SxkPyH
ie, use SERVICE <http://sparql.uniprot.org/sparql>
SELECT DISTINCT ?pathway ?pathwayName ?dbXref
?goLabel
WHERE {
?pathway rdf:type biopax3:Pathway;
biopax3:displayName ?pathwayName;
biopax3:pathwayComponent ?reaction.

?reaction rdf:type biopax3:BiochemicalReaction.

{ {?reaction ?reactRel ?protein }

UNION
{
?reaction ?reactRel ?complex.
?complex rdf:type biopax3:Complex.
?complex ?cpxRel ?protein.
}}

?protein rdf:type biopax3:Protein; biopax3:entityReference ?dbXref


SERVICE <http://sparql.uniprot.org/sparql>
{
?dbXref
up:classifiedWith ?go.
?go rdfs:label ?goLabel.
}
}
http://goo.gl/juvawt
Where are we? Some experience on the field
Linking/Navigability
Modelling/Ontologies
Reuse
?
Homogeneous
Approach?
Mappings?
Correctness?
(formal, conceptual)
Applications
URI reuse?
URI Resolution?
Content Negotiation?
Ontology
instantiation?
Resource Linking?
Focused on biomed
questions?
Potentials?
Source: http://tinyurl.com/qarj84w
Source: http://goo.gl/oWI0Yd
Source: http://tinyurl.com/ooudnhr
Geek or end user
oriented?
EFO, NCBITax, UBERON, UO, OMIM
sio:assay, obi:assay
obi:specimen
sio:specimen
Similar approach: linkedISA, McCusker et al (http://goo.gl/dsSZkS),
HSCI (http://goo.gl/g2qZjK), OMIABIS
BioRDF, Deus et al, http://goo.gl/yY2jqA
MAASTRO Clinic Data, http://tinyurl.com/od6yey3
Uniform URI Policy
Good URI linkage within our organisation
eg, uniProt's proteins
Fair linkage to external resources
especially via identifiers.org (see also myEquivalents)
Fair intern and external usage of ontologies, (see previous slide)often via EFO
Not much use of 'commercial' ontologies (eg, schema.org, FOAF)
Research on enriching the 'typed link' model
linksets (VoID for small data set partitions: https://goo.gl/uP9oty)
Uniform approach to URI resolution, LODEStar providing:
SPARQL DESCRIBE, eg,
http://rdf.ebi.ac.uk/resource/atlas/E-GEOD-20266
HTTP Content Negotiation, eg,
curl -L -H "Accept: application/rdf+xml" 'http://rdf.ebi.ac.uk/resource/atlas/E-GEOD-20266'
Minor issues
Negotiation doesn't work in identifiers.org, eg,
curl -L -H "Accept: application/rdf+xml" 'http://identifiers.org/ncbigene/10723071'
http://identifiers.org/ncbigene/10723071.rdf is available instead, but that's not standard
Cases like http://beta.sparql.uniprot.org/uniprot/#_5030353036370053 yields 404
they're blank nodes, actually having having statements attached, not sure this is good
What are we missing?

OWL? Reasoning?
Seems still hard when into the wild (performance, understandability), not much done so far (eg, approximate reasoning)
Performance issues are becoming even more serious with big data and data streams
maybe it's still interesting for apps like Q&A, user interaction (eg, SADI, http://goo.gl/MUj1c8)
Public SPARQL endpoints?
The Enduring Myth of the SPARQL Endpoint
, https://goo.gl/I5mrJH
Restricted access to Linked Open Data and SPARQL
eg, SAFE (http://tinyurl.com/pyfkksh, http://tinyurl.com/owv7fnq)
JSON-LD and Why I Hate the Semantic Web http://goo.gl/M0atTL
EBI-RDF
Biomedical investigations (involving BioSD in particular)
Non-geeky user interfaces/applications
eg, search by feature similarity in BioSD
More datasets to RDFize/interlink
BioSD eg, ENA, COSMIC
Life Science
More standardisation, integration, best practices
dataset schemas vs biomedical ontologies (https://goo.gl/g5sDZd)
More biologist-oriented interfaces/applications
Links to elsewhere (improve outreach)
other data (eg, DBPedia, BBC, would push outreaching)
other data in other fields (eg, ecology, agrifood)
OBI
SIO, PROV
IAO
FOAF,
schema.org,
BIBO
EFO, NCBITax, UBERON, UO, OMIM
SIO
Infrastructure, Outreach, Applications
background image source: http://www.ikt-online.org/blog/crowded-underground-soil-and-in-fill-material-requirements
LODEStar, used for (almost) all data sets (Provides SPARQL+UI+Navigation)
Biohackathon 2014: http://tinyurl.com/q7oadyd
RDFAtlas, an R/Bioconductor client library, with bio-oriented funtions, http://bioconductor.org/packages/2.14/bioc/html/AtlasRDF.html
The Biologist
BioSD
Graphical Query builders:
http://sparqlbuilder.org/ (UniProt)
SPARQLGraph (http://tinyurl.com/kzf4ttn, http://sparqlgraph.i-med.ac.at/)
OpenPHACTS APIs
Self-documenting web pages
adapters for workflow tools (eg, KNIME for Taverna)
Mostly provides canned queries, no SPARQL available, debate ongoing in the SW community
Live version: http://biohackathon.org/d3sparql/
Drawn from Ruttenberg A et al, http://tinyurl.com/caj66f
OpenPHACTS (http://swww.openphacts.org/2/sci/apps.html)
Future
EBI
Biomed World
The Geek
Practice proposals
1) brainstorming on building applications with EBI/RDF (and other datasets)
Talk to your classmates, propose your biomedical use case, look at our/other datasets, scope, classes, etc
Start exploring it: http://www.visualdataweb.org/relfinder/relfinder.php
Configure (spanner symbol) it with http://www.ebi.ac.uk/rdf/services/chembl
as endpoint URI, leave the defaults as is
Try it with Razaxaban and Idraparinux sodium
Use the URIs if string-search doesn't work:
http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL206335
http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL1908371)
Explore results, use external resources as the EBI platform or PUBMED
Practice proposals
2) A different way to use linked data: RelFinder
2.1) Another example based on networks
Assessing Drug Target Association Using Semantic Linked Data,
http://tinyurl.com/pmzzdxp
Cool URIs
http://www.w3.org/TR/cooluris/
Emerging practices for mapping and linking life sciences data using RDF, A case series
http://www.sciencedirect.com/science/article/pii/S1570826812000376
http://www.w3.org/2001/sw/hcls/notes/hcls-rdf-guide/
Literature on Linked Data best practices
Exercises about GXA
Find the expression for TNNI3, having transcript id = ENSG00000129991
Hint:
ENSEMBL transcripts are encoded as:
http://identifiers.org/ensembl/ENSGXXX
use: ?probe atlasterms:dbXref <transcript id>
Solution at: http://goo.gl/2ubGx2
Go back to the diabetes query and find via the ontology term efo:EFO_0000400 (root for various diabetes types)
Hint:
remove FILTER and use ?propertyClass rdfs:subClassOf* efo:EFO_0000400
'*' is a form of property paths (http://www.w3.org/TR/sparql11-property-paths/)
practical/efficient way to perform simple inference
Solution at: http://goo.gl/1xsIeJ
More Exercises about GXA
From other trainings we have done:

https://www.dropbox.com/s/mcha5rc6ddxbi7m/Practical%20Session.docx?dl=0
http://www.ebi.ac.uk/rdf/services
Practice
Explore the platform from http://www.ebi.ac.uk/rdf/services, look at the documentation and schemas
Explore VOID links
Explore endpoints, try queries
Explore dataset schema/ontology files on your own
eg, feed http://www.essepuntato.it/lode
with
https://www.ebi.ac.uk/fgpt/ontologies/gxaterms.owl
What are the subclsses of DifferentialExpressionRatio?
That can be a quick alternative to Protegé
Gene Expression Atlas
ChEMBL
Reactome
The Story So Far
RDF Platform
Andy Jenkinson (
Coordinator
)
Simon Jupp, James Malone (
RDF Platform, GE Atlas
)
Jerven Bolleman (RDF Platform, UniProt)
Ugis Sarkans (
Technical Team Leader Biosamples/Functional Genomics
)

Acknowledgements
Linked Data with The EBI RDF Platform
Training course, SIB, Geneva, 1/12/2015
Marco Brandizi, brandizi@ebi.ac.uk, www.marcobrandizi.info
this presentation: http://tinyurl.com/ebirdfsib15

(May 2015)
http://goo.gl/u2zlaC
Open Issues and Possible Future Developments
More?
http://goo.gl/F19KXo
The LODEStar Interface: https://github.com/EBISPOT/lodestar
Note: Ordering can kill performance.
SPARQL might need optimisation tricks:
http://arxiv.org/abs/1304.0567
More past training links:
Amsterdam 2015:
https://www.dropbox.com/sh/l3nz7mizc36x0ek/AAA8n1a45p-1gAfEtQ5aACvQa?dl=0
SWAT4LS:
https://www.dropbox.com/sh/yaycah8bvorz6vq/AABPlI-YhVROaAjpV5G7ROsja?dl=0
https://www.dropbox.com/sh/wy7r2vns0r0v04a/AABfRJu9zzJTaGLAGBSyYQyga?dl=0
My Prezi Profile:
https://prezi.com/user/ncrczum5bq3y
Full transcript