Prezi

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in the manual

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

NESCent seminar on Open Content Mining for phyloinformatic data

Talk given at 1pm (EST), Thursday 18th October at NESCent, Durham, NC
by Ross Mounce on 14 February 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of NESCent seminar on Open Content Mining for phyloinformatic data

Now the exciting part... This dictionary & rule-based matching approaches will help us automate the identification, sorting the wheat from the chaff (figures likely to contain a phylogeny, figures not likely to contain a phylogeny)

e.g.
Fig. 1 Representative taxa of the Antennariidae: (A) Fowlerichthys scriptissimus (photo by Y. Susaki); (B) Antennarius maculatus (photo by D. Harasti); (C) Histrio histrio (photo by D. Cook); (D) Antennatus tuberosus (photo by D. Harasti); (E) Nudiantennarius subteres (after Smith and Radcliffe, in Radcliffe, 1912);

Fig. 4. Fifty percentage majority rule phylogeny of the Antennariinae, from trees sampled in the posterior, generated from Bayesian analyses with the saturated data (third codon position of COI). Branch lengths are measured in expected substitutions per site and are proportional to length. Numbers above nodes are posterior probabilities and numbers below nodes are bootstrap proportions from 500 pseudoreplicates used from maximum likelihood analysis

Highlight stuff! ChemicalTagger PhyloTagger
for finding data & interpreting (into semantic data) figure captions of phylogenetic figures

work only just started.
Build a dictionary of all keywords & variants:
COI, cytochrome oxidase I, mtDNA, majority-rule, ML, Bremer support, posterior probability
Re-using PDFs is like trying to turn hamburgers back into a cow (attributed to PMR but not actually his phrase)
<insert funny image here>


REF Some Technical Details Want to join in? / Adapt to other data?
The real brains behind this proposal
I'm just providing a novel biological use-case for pre-existing tools and content mining workflows

Computational chemist / chemoinformaticist
Instrumental in semantic chemistry, 'Blue Obelisk movement' recently recognised with the Skolnik award. Peter Murray-Rust
Started in 2004 in Cambridge (UK)
Bringing together the Open community:

Open government data, Open economics, Open corporates, Data-Driven Journalism, Open Science, Open city data
Open GLAM (Galleries, Libraries, Archives, Museums)...

World-wide: chapters in Finland, Brazil, Switzerland, growing the network all the time The Open Knowledge Foundation
The Panton Fellowships are funded by Open Society Foundations (formerly Open Society Institute), and administered by the Open Knowledge Foundation The Panton Fellowships
The Panton Arms is a pub in Cambridge (UK)
It's where the Panton Principles for Open Data in Science were written by PMR, Cameron Neylon, Rufus Pollock and John Wilbanks The Panton Fellowships
Something I've learned:
change in academia is glacially slow

“Science progresses one funeral at a time”
(attributed to Max Planck) A sketch of the plan What Peter Murray-Rust and I want to do is to hack the tree out of it's PDF container. Rather than just treating it as an image. Our method is significantly different
TreeSnatcher Plus (Laubach et al 2012)

manual

But not so picky with the type of trees
Potential to improve (?) e.g. get tip labels Image-based techniques
TreeThief (Andrew Rambaut, 2000)
- kudos for being the 1st AFAIK

TreeRipper (Joseph Hughes, 2011)
- many significant improvements, not least of which introducing automation

~33% success rate on trees it can get Image-based techniques Manually search & scrape supp. info & lab websites

Doesn't scale.
Rare to find any 'extra' such data here anyhow How many studies are archived in TreeBASE? based on ISI Web of Science analyses (not even all the literature) >66,000 21st century (2000-2012) phylogeny papers
This data is extremely re-usable, re-mixable, valuable data; hundreds if not thousands of different hypotheses can be tested using trees:

levels of homoplasy, tree balance, species delimitation, stratigraphic congruence, host/parasite coevolution, diversification rates, the effect of extinction events,
dating nodes, ancestral state reconstruction... Phylogenetic data is hugely valuable:
economically as well as scientifically
Phylogenetic inferences are time consuming and costly to create.

CPU time survey of 108 authors of recently published phylogeny papers (2012) in MPE, P1, PNAS, Science, SystBiol, BMC EvoBiol asking them basically:
“How long did the computation take (in CPU hours)?” Why am I even doing this? I'm content mining rather than 'textmining' because it's about more than just text

A significant part of my BBSRC grant proposal is about figure mining (non-textual content!) What is content mining? Ross Mounce

Open Knowledge Foundation Panton Fellow
&
University of Bath PhD Candidate Mining phyloinformatic data
from the published literature
Raw data: https://docs.google.com/spreadsheet/ccc?key=0AtbO6mZEvieCdC1JN1YySVhyMUFLRURyc2VaSUdjblE median: >1000 CPU hours (30% response rate, not all responses were helpful) You attribute the work to its author and respect the rights and licenses associated with its components. This presentation. Provided that: Blog, tweet, or post video of; Photograph, film, or broadcast; Copy, share, adapt, or re-mix; You are free to: [adapted from http://www.slideshare.net/CameronNeylon/permissions] Content can include video, audio, metadata, text and images - anything we can mine. We NEED to keep phylogenetic data So how many phylogeny papers are there? I gave my initial guesstimate at the
Young Systematists' Forum, 1st Dec 2010
[ my first big Prezi too: http://www.bit.ly/phylodata ] 40,000 phylogeny papers (As it turns out... I was wrong!) So how many phylogeny papers are there (II)? Stoltzfus A., O’Meara B., Whitacre J., Mounce R., Gillespie E., Kumar S., Rosauer D. & Vos R.
Sharing and Re-use of Phylogenetic Trees (and associated data) to Facilitate Synthesis. BMC Research Notes 2012. in press, should be out any day now... with supporting in Emailing the corresponding author? How do we get this data back? We show <4% of publication year:2010 phylogenetic analyses have data in TreeBASE ...and there's a frightening volume of Neighbor-Joining (NJ) analyses out there! We can fix the future: + Research Funder Data Management Mandates & http://thedatahub.org/ But how can we get back the data from the past? More rigorous analyses such as Wicherts et al (2006) show 25.7% 'success' for emailed data requests
My personal experience during my PhD... 30% - 50% 'success' ...good luck with that!
It's a 20th century solution to a 21st century problem http://dx.doi.org/10.1037/0003-066X.61.7.726 How do we get this data back? Re-interpret the image of the phylogeny?

Many have made valiant efforts in this respect
Genuinely useful for getting small numbers of trees
...of certain types, conditions & caveats a plenty But a bit too picky with the type of trees it can do
Potential to improve (?) and to be incorporated into other workflows... N.B. This is one of the first papers I had the pleasure of reviewing. Vector graphics retain a LOT of info It can be relatively simple to re-extract high precision information from vector graphics. But it's an open question as to how many published phylogeny figures are vector & how many are raster http://en.wikipedia.org/wiki/File:Agapornis_phylogeny.svg Peter has significant & extensive experience in this area, working on extracting data from the chemical literature Download 1000s of papers "feed" them to mining scripts the scripts interpret the text, and
identify useful figures that contain data mining *text* is relatively easy getting data from *figures*
is harder but doable New best practice: vector graphics! Need to create a MIAPA metadata annotation ontology Have already begun with OA BMC corpus creation and first pass exploration & annotation https://github.com/rossmounce/BMCphyloannotation (We also have anything ever published by BMC containing 'phylogen*' >8000 papers) Getting PDFs & bibliographic metadata is easy and can be done for any structured publisher website with PubCrawler Info: Code: http://openbiblio.net/2012/06/13/pubcrawler-finding-research-publications/ https://bitbucket.org/petermr/pub-crawler Some Technical Details (II) ChemicalTagger
for finding data & interpreting (into semantic data)
figure captions of chemical figures

will be redeployed in this context as: 'PhyloTagger'

Demo of CT here: https://bitbucket.org/petermr/chemicaltagger Hawizy et al 2011 http://www.jcheminf.com/content/3/1/17 http://chemicaltagger.ch.cam.ac.uk/index.html High precision High recall They are there to fund UK-based young early career scientists to work on encouraging Open Data ethos. First two went to Sophie Kershaw (Oxford) & me (Bath). http://pantonprinciples.org/panton-fellowships/ Science is based on building on, reusing and openly criticising the published body of scientific knowledge.
For science to effectively function, and for society to reap the full benefits from scientific endeavours, it is crucial that science data be made open. http://pantonprinciples.org/ http://okfn.org/chapters/ The Open Knowledge Foundation Many active mailing lists & skype meetings

Open Science:

Open Access:

Open Bibliography: http://lists.okfn.org/mailman/listinfo/open-science + the annual conference; OKFestival http://okfestival.org/ http://lists.okfn.org/mailman/listinfo/open-access http://lists.okfn.org/mailman/listinfo/open-bibliography http://blogs.ch.cam.ac.uk/pmr/ Some Technical Details (III) The content mining engine:
AMI2
(Amanuensis 2)

a Java/Maven project using the Apache PDFBox libraries https://bitbucket.org/petermr/ami2 #PDFhacking @RMounce #OpenContentMining http://www.biomedcentral.com/1471-2105/12/178 http://www.biomedcentral.com/1471-2105/13/110 http://en.wikipedia.org/wiki/Peter_Murray-Rust Thanks Todd Vision for inviting me here to speak
Karen, Hilmar and more for helping to arrange my visit
Graham Slater for inviting me to speak at SVP tomorrow(!)
Rod Page for getting me on Twitter (life-changing, srsly!)
Peter Murray-Rust (Cambridge) & Matthew Wills (Bath) for being brilliant mentors

OKF&OSF&BBSRC&TheWilliHennigSociety for $$$

...and thank YOU for watching
See the full transcript