BOAI-compliant Open Access & content mining

Given at the the Open Knowledge Festival, Helsinki Wed 19th Sept, 2012

Ross Mounce

on 24 February 2013

Transcript of BOAI-compliant Open Access & content mining

BOAI-compliant open access/data licensing matters: content mining Ross Mounce, University of Bath & OKF Panton Fellow Facts do not owe their origin to an act of authorship, they are not original, and thus are not copyrightable What is Content Mining? to extract, process and republish content manually or by machine

'content' includes text, numbers, tables, images, video, audio, bibliographic data & metadata (thus we can mine, and republish them) The 'tree of life' Cutting-edge content mining: applications more than 100,000
individual studies
of phylogeny have
been published

most cover very
small parts of the
tree of life The best collection of tree data so far, only has data from <3,000 publications
(it relies on authors to deposit their data) Moreover, even in 2010 the rate of data deposition was only ~ 4% Stoltzfus A, O'Meara B, Whitacre J, Mounce R, Gillespie E, Kumar S, Rosauer D, Vos R
Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis. BMC Research Notes (accepted) Extracting data from figures Download 1000s of papers "feed" them to mining scripts the scripts interpret the text, and
identify useful figures that contain data mining *text* is relatively easy getting data from
is harder but doable it helps if the figures are vector,
NOT raster graphics. More on this next... Vector graphics should be mandatory - NO rasters It can be relatively simple to re-extract information from vector graphics. But is seems that the majority(?) of digitally published diagrams examined so far are rasters http://en.wikipedia.org/wiki/File:Agapornis_phylogeny.svg We have the technology & capacity to do this http://www.guardian.co.uk/science/2012/may/23/text-mining-research-tool-forbidden ...but it seems like we might get into legal troubles if
we apply this to some subscription access content Peter Murray-Rust once got Cambridge access cut-off,
after attempting to mine some literature What are subscribers allowed to do with content?
(not much it seems) Only CC BY literature is 'safe' to mine "Subscriber shall not use spider or web-crawling or other software programs, routines, robots or other mechanized devices to continuously and automatically search and index any content accessed online under this Agreement" From an Elsevier subscription agreement (2011) http://blogs.ch.cam.ac.uk/pmr/2011/11/25/the-scandal-of-publisher-forbidden-textmining-the-vision-denied/ and the excuses given are fanciful e.g. "...platforms would collapse under the technological weight of crawler-bots... [like a] denial-of-service attack" Richard Mollet, Publishers Association http://www.publishers.org.uk/index.php?option=com_content&view=article&id=1929:content-mining-free-for-all-would-be-bad-for-al&catid=499:general&Itemid=1608 Thus science needs Open Access not just 'free access' many 'open access' journals are not explicitly licensed to allow re-use I'm working with Peter Murray-Rust
to extract open data from research literature *do* read his excellent blog: http://blogs.ch.cam.ac.uk/pmr/ Conclusions with content mining we can salvage otherwise 'lost' data - this is immensely valuable we can synthesise data from millions of papers to better harness ALL previous research without doubt content mining will be increasingly applied in research across all domains of academia (its not just of use in biomedical research!) Independent reviews such as the Hargreaves Report recommend that the potential benefits of mining are so great that exceptions should be made to Copyright law especially to allow mining. Explicitly 'mining-friendly' licenses such as CC BY must be used to publish all future research - so one must be careful to define Open Access (BOAI/BBB). The Budapest Open Access Initiative (BOAI) By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited. http://www.soros.org/openaccess/read Feb 14th, 2002 @RMounce
