Introducing
Your new presentation assistant.
Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.
Trending searches
Thanks
Presentation Overview,
6 December 2011, Australian Computer Society, Canberra
Scaling Up: The Technology behind the National Library of Australia’s
Newspaper Digitisation and Trove Search Service.
Disclaimer: The views and opinions in this talk are my own. I am not
speaking as a representative of the National Library of Australia. I
currently work at the NLA as a part-time contract programmer. I am
not part of the management structure, I do not work in the "business
area" of the library and I do not make policy decisions.
Themes of this presentation:
The Australian Newspapers Digitisation Program is a collaboration
between the state and territory libraries led by the NLA. The
Australian Newspapers Digitisation Program began in March 2007 and
the Australian Newspapers web site was released as a beta version in
August 2008 as a free online service which enabled full text searching
of newspaper articles as well as tagging, comments and correction of
OCR'd text. http://trove.nla.gov.au/ndp/del/home Because the OCR
quality varies greatly, correction of the text greatly assists search
recall.
Rose Holley has written extensively on the Newspapers digitisation
program (see http://www.google.com.au/search?q=rose+holley+newspapers
and links below).
Providing the capability to correct the OCR text has resonated with
the public. Over 52 million lines have been corrected since July
2008. The public have resources, interests and expertise that dwarf
that of the NLA, and by ceding control of the quality of the OCR to
the public has been very rewarding. The initial fears that either the
public would be uninterested or that the correction, tagging and
commenting facilities would only attract vandals and spammers were
unfounded.
After the initial implementation of the Newspapers delivery service,
and based on the success of the new infrastructure, the NLA started
implementing the Trove single search system, based on the same
infrastructure. http://trove.nla.gov.au/ The standalone Newspapers
service was "folded in" to Trove late in 2010. Although Trove links
to millions of freely available and licensed digital resources, the
Newspapers content dominates activity in Trove (with over 80% of
searches and page views). Trove currently receives 4.7M http "hits"
per day, generates 60GB of outbound responses and about 400,000 page
views to about 40,000 "human" visitors (and a similar page volume to
robots/web-crawlers).
Trove uses the open-source Lucene full text search library, with some
indices using the SOLR front-end. It uses MySQL as a relational
database, and the system is written in Java using both JSP and
Restlets with Freemarker for web page generation.
Trove manages over 250 million records and about 1.1TB of Lucene index
and a 1.2TB of MySQL databases. Newspaper image derivatives of 6
million pages use 70TB of storage, whilst the "master" images of those
pages occupy 250TB.
Trove search service is implemented as a "master" server which
performs updates and replicates copies of indices to (currently) 4
slave servers. Each slave is a "commodity" server with 64GB of
memory, between 8 and 12 Intel Xeon cores, and between 3 and 5
"commodity" Solid State Disks (SSDs).
Because of the index size, the complexity of relevance-ranked queries
and the volume of queries (peaking at over 80 per second), normal
"spinning disk", regardless of configuration, could not economically
meet the IO requirements of Trove. However, SSDs can quickly perform
many thousands of random 4K reads per second, and it is SSDs which
make feasible the approach we've taken with Trove.
Trove handles over 250K record updates per day. Updates to
bibliographic data are processed at a rate of 100 per second; updates
to full text records (newspaper articles, full text journals) at a
rate of 50 per second.
There are many opportunities and demands for growth of the service.
There are journal article sources containing hundreds of millions of
new articles. The NLA maintains a "whole .au domain web harvest" of
approximately 100TB, which is currently a "dark archive". There is
enormous demand and potential for mass digitisation and digital
delivery of other library materials (such as books, manuscripts) and
tens of millions more newspaper pages to process.
To date, hardware capabilities have kept pace with data and search
volumes, but future growth may require different approaches. In the
short term, the just announced Lucene 3.5 and forthcoming Lucene 4
will greatly reduce memory and CPU requirements and will help us
reduce update latency.
Trove exemplifies the value of online access, the "many hands make
light work" principle, and that trust is usually repaid
handsomely(http://www.nla.gov.au/ndp/project_details/documents/ANDP_ManyHands.pdf
). But Trove is just the first "baby steps" along a challenging path
that library, and more broadly, cultural institutions must travel to
retain relevance in shaping conversations and our society. Digital
delivery offers the prospect of not only better service but also
ultimately cost savings in providing these services. However,
national cultural institutions such as the NLA are operating in very
tight budgetary conditions. Amongst the key challenges for the NLA
and Trove is to make content acquisition cheaper and faster, to create
an infrastructure capable of dealing with the "digital deluge" and to
make the terrifying leap from the physical world of atoms (books) to
the virtual world of bits, to, as Nat Torkington puts it, understand
the implications of the library's reading room really being our
patron's web browser. For many, if it isn’t digitally accessible it
simply doesn’t exist.
Trove is the product of many people working together - many dedicated
staff at the NLA and partner institutions, and thousands of members of
the public. However, Trove has had two champions which have given it
life and shaped its direction: Warwick Cathro and Rose Holley. Warwick
retired earlier this year after 33 years at the NLA. As the Assistant
Director-General for Resource Sharing and Innovation, he was the
champion and driver responsible for both the newspapers and Trove
projects. He created the freedom and "space" for their development
which led to their success. Rose was recruited from the University of
Auckland to be the manager of the newspapers project, and as a natural
and inspirational librarian, innovator and tireless worker and
communicator, set the tone for the project, and subsequently for the
Trove project which she also managed. Rose announced today that she
is about to embark on a new digitisation challenge at the National
Archives of Australia.
Related Links:
Websites
Trove website: http://trove.nla.gov.au
Australian Newspapers website: http://trove.nla.gov.au/ndp/del/home
Australian Newspapers Digitisation Program Project Documentation
website: http://www.nla.gov.au/ndp/project_details/
Papers/Articles on Trove
Cathro, W. (2010). Developing Trove: The Policy and Technical
Challenges. http://www.nla.gov.au/openpublish/index.php/nlasp/article/view/1666
Holley, R. (2010). Trove: Innovation to Access to Information in
Australia. http://www.ariadne.ac.uk/issue64/holley/
Holley, R (2010). Crowdsourcing: How and Why Should Libraries Do It?
http://www.dlib.org/dlib/march10/holley/03holley.html
Holley, R (2011). Trove: The First Year January 2010- January 2011
http://www.nla.gov.au/openpublish/index.php/nlasp/article/view/1882
Holley, R (2011). Resource Sharing in Australia: ‘Find’ and ‘Get’ in
Trove – making ‘Getting’ better
http://www.nla.gov.au/openpublish/index.php/nlasp/article/view/1868
Holley, R (2011). Extending the scope of Trove: Addition of
e-resources subscribed to by Australian Libraries
http://dlib.org/dlib/november11/holley/11holley.html
Trove system architecture diagram
http://www.nla.gov.au/trove/marketing/Trove%20architecture%20diagram.pdf
Presentations
This presentation: http://prezi.com/tcutgfi1ll2k/scaling-up/
Other presentations on Trove and Newspapers:
http://www.slideshare.net/RHmarvellous/presentations
Presentation to the Australian Computer Society Canberra Branch
6 December 2011
Kent Fitch
National Library of Australia
and Project Computing Pty Ltd
Australian Newspapers
Digitisation Program
INGEST 1
SCAN
INDEX TEXT
"A question of
collaboration"
Peter Macinnis
Ockham's Razor
ABC-RN
27 Nov 2011
‘A wonderful tool - the amount of user
control is very surprising but refreshing.’
'The site you manage is a nightmare! It’s addictive. Keeps me awake at night. Congratulations!'
‘An interesting way of using interested
readers “labour”! I really like it.’
The National Library of Australia ... have a remarkable resource called Trove. This includes a massive collection of digitised newspapers, linked to computer generated text which registered users can correct. More than that, they can add generic tags like 'bushrangers', '39th Battalion' or one of my favourites 'early use of language", the tag I hang on the earliest instances I find of words like 'squatter', 'billy', 'bludger', 'swag' and 'fossick' among others. Those tags are there for all time.
Slowly the resource is growing and the prospects for future researchers are being quietly enhanced, not by hewing and slashing behemoths but by nibbling and gnawing mice. One day I hope people will turn around, raise their eyebrows and ask "Where did all that come from?
A future built on collaboration relies on people who gain a quiet joy from contributing gems, nuggets and crumbs to future generations, whimsical folk who amuse themselves by committing acts of anonymous scholarship.
Goal
GENERATE DERIVATIVES
Improve access to
Australian newspapers
How
a national initiative to
build a online service
that provides free access
to newspapers published
between 1804 and 1954
QA 1
400dpi
greyscale +
bitonal
‘OCR text correction is great! I think
I just found my new hobby!’
‘It’s looking like it will be very cool
and the text fixing and tagging is
quite addictive.’
‘I applaud the capability for readers
to correct the text.’
assign
http://www.abc.net.au/radionational/programs/ockhamsrazor/a-question-of-collaboration/3692142
APPLY DESKEW TO GREYSCALE
QA 2
QA 3
INGEST 2
BATCH
DESKEW
ZONE
OCR
CATEGORISE
Digitising Newspaper Content
Humans...
Trove
40K visits/day
400K page views/day
240K searches/day
750K unique visitors/month
Humans...
Trove
40K visits/day
400K page views/day
240K searches/day
750K unique visitors/month
Newspapers content dominates
80% of searches
("all zone" search is a further 10%)
125K newspaper page or article views/day
Lists
Tags
Robots..
..about the same as humans
Googlebot requests
~220K pages/day
Users
Comments
Newspapers content dominates
Newspaper
corrections
Merges/Splits
Content
Participation
80% of searches
("all zone" search is a further 10%)
125K newspaper page or article views/day
mostly anonymous, but...
Documents Lucene Index MySQL
Newspapers 59M 320GB 710GB
46,000
logged on at least once
Journal Articles 132M 420GB
Humans
4.7M http hits/day
Humans + zombies
Books,photos,... 24M 120GB
520GB
4,300
logged on > 20 times
Website archive 36M 260GB
15K in total
20 users per month
500 actions per month
~55 per sec averaged over the day
900K hits of newspaper image tiles
60GB/day outbound WAN traffic
People 1M 3GB
200
logged on > 500 times
Total 252M 1.1TB 1.2TB
1.8M in total
900 users create 50K each month
33K in total
220 users create 1.5K each month
16K in total
500 users create at least 1 each month
+
52M lines in total
3200 users correct 2.4M lines per month
120K different articles per month
Newspaper image data
...but 0.6M automatically derived from wikipedia
Of the rest:
...and growing
6 M pages
70TB derivatives
250TB masters
jpg tiles - 8 resolutions plus PDF with text
Plus unknown number anonymous users - 12% of lines
greyscale + bitonal TIFF @ 400dpi
LZW compression
greyscale:bitonal size - 30:1
Trove system architecture
Hardware
Software
service wrapper for lucene
web service for
search/update/admin
replication
sharding
schemas
open source
very active development
full text search engine library
written in Java
very easy to use
very fast
clever "segment" design
very scalable
open source
very active development
Single update master - multiple search slaves
Commodity master server with data on SAN
Commodity slave servers with local SSD
Image derivatives on a SAN
Image masters staged (mostly on tape)
Lucene (newspapers) and
SOLR/Lucene (everything else)
MySQL
Linux
Java: Jetty/JSP (newspapers),
Jetty/Restlet/Freemarker (everything else)
haproxy
Two copies of each index are
distributed across the slaves
..so a slave can fail, and the
system still works
mySQL runs on a separate server
The Trove "UI" JVM is managed
like an index slave
SAN
Newspaper Index
JournalArticle index
Books etc index
Webarchive index
People index
Update Master
Load balancers direct requests to slaves
Slave 1
Slave 2
Slave 4
Slave 3
Newspaper Index
Books etc index
Newspaper Index
Webarchive index
People index
Newspaper Index
JournalArticle index
People index
JounalArticle index
Books etc index
Webarchive index
Slave hardware
huge indices, >> memory
memory cache doesn't help much
I/O madness
Each incoming "user" search
generates 9 lucene index searches
... that's between 80 - 100 searches/sec
during busy hours
Even the simplest query is expanded to a faceting, complex, boolean mess to improve relevance ranking
Each search is "expensive"
Typical busy period
single slave drive:
5K reads/sec
30MB/sec
0.22ms queue+service time
1000 reads: 1 write
field
slave CPU:
q= required: (pattersons curse)
optional:
(text:"pattersons"^0.5 OR text:"curse"^0.5) OR
(title:"pattersons"^4 OR title:"curse"^4) OR
(creator:"pattersons"^3 OR creator:"curse"^3) OR
(subject:"pattersons"^3 OR subject:"curse"^3 ) OR
(s_subject:"pattersons"^1 OR s_subject:"curse"^1 ) OR
(s_title:"pattersons"^1.3 OR s_title:"curse"^1.3 ) OR
(s_creator:"pattersons"^1 OR s_creator:"curse"^1 ) OR
(s_text:"pattersons curse"~2000^0.5 ) OR
(text:"pattersons curse"~2000^3 ) OR
(title:"pattersons curse"~40^12 ) OR
(creator:"pattersons curse"~20^9 ) OR
(subject:"pattersons curse"~2000^9 ) OR
(s_subject:"pattersons curse"~2000^4 ) OR
(exactTitle:"pattersons curse"^25 )
40% user, 5% IO wait, 10 load-avg10
subject:"pattersons curse" ~2000 ^9
slop
Updates per day
score boost
Relevance Scoring is based on "TF IDF"
a function of the number
of times the term
appears in the document
Term Frequency * Inverse Document Frequency
Journal articles
Newspaper articles
Libraries Australia
Pandora webpages
OpenLibrary
Newspaper corrections
Misc harvested content
HathiTrust
Tags/comments/merges
90k
80k
35k
25k
10k
8k
7k
6k
2k
~260K
an (inverse) function of the
number of documents in which
the term appears
... modified by document length and document "boosts"
Update Processing
Newspaper derivatives
Newspaper and journal article indexing
Bibliographic data matching and indexing
Scale a bit more
Our pain points
Add more index copies (slaves)
Java heap & GC
crazy # of terms..
Faceting
Scale a lot more
100's of millions of articles?
whole domain web harvest ~100TB?
mass book digitisation?
manuscripts digisation?
ongoing newspapaper & magazine digitisation?
expensive, possibly bogus,
possibly not useful
Split ("shard") index,
then replicate shards
Update Latency
newspapers UI is 4 years old..
UI's age quickly
Touch interface
Hathi mirror? (8.6m volumes, 7k tons, 400TB)
OAIster?
Australia research data?
public e-repository?
ABC archives?
electronic books?
Access is paramount
Online presence is paramount
Many hands make light work
"the public" collectively have resources, interests, expertise which dwarf our own
Quantity over quality
imperfect data still has great value
Ceding control can be very
rewarding
Lucene scales very well
Good relevance ranking is
expensive
SSD is very cost effective
jpeg2000 would have been nice
..one day
Where to from here?
What are libraries about?
Books?
Bits?
Nathan Torkington
Libraries: Where It All Went Wrong
Like Microsoft, libraries are hoping to stay relevent with "digital bolt-ons"
What are libraries about?
Universal Access
Ideas
Conversations
People
"You’ve added digital after the fact. You probably have special digital groups, probably (hopefully) made up of younger people than the usual library employee.
Congratulations, you just reproduced Microsoft’s strategy: let’s build a few digital bolt-ons for our existing products. Then let’s have some advance R&D guys working on the future while the rest of us get on with it. But think about that for a second. What are the rest of us working on, if those young kids are working on the future? Ah, it must be the past.
So what you’ve effectively done is double-down on the past."
2007-2012
Manager of both the Newspapers Digitisation
and Trove projects
NLA, 1978-2011
Chair of both the Newspapers Digitisation and Trove project boards