Introducing

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Scaling up

kent fitch

Updated Dec. 6, 2011

Transcript

Thanks

Presentation Overview,

6 December 2011, Australian Computer Society, Canberra

Scaling Up: The Technology behind the National Library of Australia’s

Newspaper Digitisation and Trove Search Service.

Disclaimer: The views and opinions in this talk are my own. I am not

speaking as a representative of the National Library of Australia. I

currently work at the NLA as a part-time contract programmer. I am

not part of the management structure, I do not work in the "business

area" of the library and I do not make policy decisions.

Themes of this presentation:

The Australian Newspapers Digitisation Program is a collaboration

between the state and territory libraries led by the NLA. The

Australian Newspapers Digitisation Program began in March 2007 and

the Australian Newspapers web site was released as a beta version in

August 2008 as a free online service which enabled full text searching

of newspaper articles as well as tagging, comments and correction of

OCR'd text. http://trove.nla.gov.au/ndp/del/home Because the OCR

quality varies greatly, correction of the text greatly assists search

recall.

Rose Holley has written extensively on the Newspapers digitisation

program (see http://www.google.com.au/search?q=rose+holley+newspapers

and links below).

Providing the capability to correct the OCR text has resonated with

the public. Over 52 million lines have been corrected since July

2008. The public have resources, interests and expertise that dwarf

that of the NLA, and by ceding control of the quality of the OCR to

the public has been very rewarding. The initial fears that either the

public would be uninterested or that the correction, tagging and

commenting facilities would only attract vandals and spammers were

unfounded.

After the initial implementation of the Newspapers delivery service,

and based on the success of the new infrastructure, the NLA started

implementing the Trove single search system, based on the same

infrastructure. http://trove.nla.gov.au/ The standalone Newspapers

service was "folded in" to Trove late in 2010. Although Trove links

to millions of freely available and licensed digital resources, the

Newspapers content dominates activity in Trove (with over 80% of

searches and page views). Trove currently receives 4.7M http "hits"

per day, generates 60GB of outbound responses and about 400,000 page

views to about 40,000 "human" visitors (and a similar page volume to

robots/web-crawlers).

Trove uses the open-source Lucene full text search library, with some

indices using the SOLR front-end. It uses MySQL as a relational

database, and the system is written in Java using both JSP and

Restlets with Freemarker for web page generation.

Trove manages over 250 million records and about 1.1TB of Lucene index

and a 1.2TB of MySQL databases. Newspaper image derivatives of 6

million pages use 70TB of storage, whilst the "master" images of those

pages occupy 250TB.

Trove search service is implemented as a "master" server which

performs updates and replicates copies of indices to (currently) 4

slave servers. Each slave is a "commodity" server with 64GB of

memory, between 8 and 12 Intel Xeon cores, and between 3 and 5

"commodity" Solid State Disks (SSDs).

Because of the index size, the complexity of relevance-ranked queries

and the volume of queries (peaking at over 80 per second), normal

"spinning disk", regardless of configuration, could not economically

meet the IO requirements of Trove. However, SSDs can quickly perform

many thousands of random 4K reads per second, and it is SSDs which

make feasible the approach we've taken with Trove.

Trove handles over 250K record updates per day. Updates to

bibliographic data are processed at a rate of 100 per second; updates

to full text records (newspaper articles, full text journals) at a

rate of 50 per second.

There are many opportunities and demands for growth of the service.

There are journal article sources containing hundreds of millions of

new articles. The NLA maintains a "whole .au domain web harvest" of

approximately 100TB, which is currently a "dark archive". There is

enormous demand and potential for mass digitisation and digital

delivery of other library materials (such as books, manuscripts) and

tens of millions more newspaper pages to process.

To date, hardware capabilities have kept pace with data and search

volumes, but future growth may require different approaches. In the

short term, the just announced Lucene 3.5 and forthcoming Lucene 4

will greatly reduce memory and CPU requirements and will help us

reduce update latency.

Trove exemplifies the value of online access, the "many hands make

light work" principle, and that trust is usually repaid

handsomely(http://www.nla.gov.au/ndp/project_details/documents/ANDP_ManyHands.pdf

). But Trove is just the first "baby steps" along a challenging path

that library, and more broadly, cultural institutions must travel to

retain relevance in shaping conversations and our society. Digital

delivery offers the prospect of not only better service but also

ultimately cost savings in providing these services. However,

national cultural institutions such as the NLA are operating in very

tight budgetary conditions. Amongst the key challenges for the NLA

and Trove is to make content acquisition cheaper and faster, to create

an infrastructure capable of dealing with the "digital deluge" and to

make the terrifying leap from the physical world of atoms (books) to

the virtual world of bits, to, as Nat Torkington puts it, understand

the implications of the library's reading room really being our

patron's web browser. For many, if it isn’t digitally accessible it

simply doesn’t exist.

Trove is the product of many people working together - many dedicated

staff at the NLA and partner institutions, and thousands of members of

the public. However, Trove has had two champions which have given it

life and shaped its direction: Warwick Cathro and Rose Holley. Warwick

retired earlier this year after 33 years at the NLA. As the Assistant

Director-General for Resource Sharing and Innovation, he was the

champion and driver responsible for both the newspapers and Trove

projects. He created the freedom and "space" for their development

which led to their success. Rose was recruited from the University of

Auckland to be the manager of the newspapers project, and as a natural

and inspirational librarian, innovator and tireless worker and

communicator, set the tone for the project, and subsequently for the

Trove project which she also managed. Rose announced today that she

is about to embark on a new digitisation challenge at the National

Archives of Australia.

Scaling up

The technology behind the NLA's newspaper digitisation and the Trove search service

Presentation to the Australian Computer Society Canberra Branch

6 December 2011

Kent Fitch

National Library of Australia

and Project Computing Pty Ltd

Australian Newspapers

Digitisation Program

INGEST 1

SCAN

INDEX TEXT

"A question of

collaboration"

Peter Macinnis

Ockham's Razor

ABC-RN

27 Nov 2011

‘A wonderful tool - the amount of user

control is very surprising but refreshing.’

'The site you manage is a nightmare! It’s addictive. Keeps me awake at night. Congratulations!'

‘An interesting way of using interested

readers “labour”! I really like it.’

The National Library of Australia ... have a remarkable resource called Trove. This includes a massive collection of digitised newspapers, linked to computer generated text which registered users can correct. More than that, they can add generic tags like 'bushrangers', '39th Battalion' or one of my favourites 'early use of language", the tag I hang on the earliest instances I find of words like 'squatter', 'billy', 'bludger', 'swag' and 'fossick' among others. Those tags are there for all time.

Slowly the resource is growing and the prospects for future researchers are being quietly enhanced, not by hewing and slashing behemoths but by nibbling and gnawing mice. One day I hope people will turn around, raise their eyebrows and ask "Where did all that come from?

A future built on collaboration relies on people who gain a quiet joy from contributing gems, nuggets and crumbs to future generations, whimsical folk who amuse themselves by committing acts of anonymous scholarship.

Goal

GENERATE DERIVATIVES

Improve access to

Australian newspapers

How

a national initiative to

build a online service

that provides free access

to newspapers published

between 1804 and 1954

QA 1

400dpi

greyscale +

bitonal

‘OCR text correction is great! I think

I just found my new hobby!’

‘It’s looking like it will be very cool

and the text fixing and tagging is

quite addictive.’

‘I applaud the capability for readers

to correct the text.’

assign

title
issue date
page #

http://www.abc.net.au/radionational/programs/ockhamsrazor/a-question-of-collaboration/3692142

APPLY DESKEW TO GREYSCALE

QA 2

resolve duplicates
identify missing issues/pages

QA 3

check deskew, zoning, ocr, categorisation

INGEST 2

BATCH

FTP

DESKEW

ZONE

OCR

CATEGORISE

bitonals
title, issue, page metadata

page and article level metadata
raw ocr with word positions
some rekeyed text

Digitising Newspaper Content

Humans...

Trove

40K visits/day

400K page views/day

240K searches/day

750K unique visitors/month

Humans...

Trove

40K visits/day

400K page views/day

240K searches/day

750K unique visitors/month

Newspapers content dominates

80% of searches

("all zone" search is a further 10%)

125K newspaper page or article views/day

Lists

This screen shot has been doctored for this presentation

80% of searches

("all zone" search is a further 10%)

125K newspaper page or article views/day

mostly anonymous, but...

Documents Lucene Index MySQL

Newspapers 59M 320GB 710GB

46,000

logged on at least once

Journal Articles 132M 420GB

half have only logged on once

Humans

4.7M http hits/day

Humans + zombies

Books,photos,... 24M 120GB

520GB

4,300

logged on > 20 times

Website archive 36M 260GB

15K in total

20 users per month

500 actions per month

~55 per sec averaged over the day

900K hits of newspaper image tiles

60GB/day outbound WAN traffic

People 1M 3GB

200

logged on > 500 times

Total 252M 1.1TB 1.2TB

1.8M in total

900 users create 50K each month

33K in total

220 users create 1.5K each month

16K in total

500 users create at least 1 each month

52M lines in total

3200 users correct 2.4M lines per month

120K different articles per month

Newspaper image data

...but 0.6M automatically derived from wikipedia

Of the rest:

1.2M on newspaper articles
13K on everything else...

...and growing

6 M pages

70TB derivatives

250TB masters

jpg tiles - 8 resolutions plus PDF with text

Plus unknown number anonymous users - 12% of lines

greyscale + bitonal TIFF @ 400dpi

LZW compression

greyscale:bitonal size - 30:1

Trove system architecture

Hardware

Software

?SOLR

Lucene?

service wrapper for lucene

web service for

search/update/admin

replication

sharding

schemas

open source

very active development

full text search engine library

written in Java

very easy to use

very fast

clever "segment" design

very scalable

open source

very active development

Single update master - multiple search slaves

Commodity master server with data on SAN

Commodity slave servers with local SSD

Image derivatives on a SAN

Image masters staged (mostly on tape)

Lucene (newspapers) and

SOLR/Lucene (everything else)

MySQL

Linux

Java: Jetty/JSP (newspapers),

Jetty/Restlet/Freemarker (everything else)

haproxy

3 copies for newspapers

Two copies of each index are

distributed across the slaves

..so a slave can fail, and the

system still works

mySQL runs on a separate server

The Trove "UI" JVM is managed

like an index slave

SAN

Newspaper Index

JournalArticle index

Books etc index

Webarchive index

People index

Update Master

Load balancers direct requests to slaves

Slave 1

Slave 2

Slave 4

Slave 3

Newspaper Index

Books etc index

Newspaper Index

Webarchive index

People index

Newspaper Index

JournalArticle index

People index

JounalArticle index

Books etc index

Webarchive index

Slave hardware

"standard"
64GB RAM
12 CPU - 2 x 6core X5670 2.93GHz
mixture of Intel X-25M 160GB and Crucial C300 256GB; 3- 5 drives on each slave

$10-$15K each

huge indices, >> memory

memory cache doesn't help much

I/O madness

Each incoming "user" search

generates 9 lucene index searches

... that's between 80 - 100 searches/sec

during busy hours

Even the simplest query is expanded to a faceting, complex, boolean mess to improve relevance ranking

Each search is "expensive"

Typical busy period

single slave drive:

5K reads/sec

30MB/sec

0.22ms queue+service time

1000 reads: 1 write

field

slave CPU:

q= required: (pattersons curse)

optional:

(text:"pattersons"^0.5 OR text:"curse"^0.5) OR

(title:"pattersons"^4 OR title:"curse"^4) OR

(creator:"pattersons"^3 OR creator:"curse"^3) OR

(subject:"pattersons"^3 OR subject:"curse"^3 ) OR

(s_subject:"pattersons"^1 OR s_subject:"curse"^1 ) OR

(s_title:"pattersons"^1.3 OR s_title:"curse"^1.3 ) OR

(s_creator:"pattersons"^1 OR s_creator:"curse"^1 ) OR

(s_text:"pattersons curse"~2000^0.5 ) OR

(text:"pattersons curse"~2000^3 ) OR

(title:"pattersons curse"~40^12 ) OR

(creator:"pattersons curse"~20^9 ) OR

(subject:"pattersons curse"~2000^9 ) OR

(s_subject:"pattersons curse"~2000^4 ) OR

(exactTitle:"pattersons curse"^25 )

40% user, 5% IO wait, 10 load-avg10

subject:"pattersons curse" ~2000 ^9

slop

Updates per day

score boost

Relevance Scoring is based on "TF IDF"

a function of the number

of times the term

appears in the document

Term Frequency * Inverse Document Frequency

Journal articles

Newspaper articles

Libraries Australia

Pandora webpages

OpenLibrary

Newspaper corrections

Misc harvested content

HathiTrust

Tags/comments/merges

90k

80k

35k

25k

10k

~260K

an (inverse) function of the

number of documents in which

the term appears

... modified by document length and document "boosts"

Update Processing

Newspaper derivatives

deskew, "enhance", tile 8 resolutions, create PDF with text: 20 sec/page elapsed
Typically run between 3 and 6 derivative generation JVMs

Newspaper and journal article indexing

50/sec

Bibliographic data matching and indexing

100/sec

Scale a bit more

Our pain points

Growth ?

Add more index copies (slaves)

Java heap & GC

crazy # of terms..

Faceting

Scale a lot more

100's of millions of articles?

whole domain web harvest ~100TB?

mass book digitisation?

manuscripts digisation?

ongoing newspapaper & magazine digitisation?

expensive, possibly bogus,

possibly not useful

Split ("shard") index,

then replicate shards

Lucene 3.5 and 4 to the rescue...

Update Latency

redesign of in-memory Term Index (3-5x heap reduction)
lucene index doc values
Near Real-Time updates

World

newspapers UI is 4 years old..

UI's age quickly

Touch interface

Lessons Learnt

Hathi mirror? (8.6m volumes, 7k tons, 400TB)

OAIster?

Australia research data?

public e-repository?

ABC archives?

electronic books?

Australia

Canberra

Access is paramount

Online presence is paramount

Many hands make light work

"the public" collectively have resources, interests, expertise which dwarf our own

Quantity over quality

imperfect data still has great value

Ceding control can be very

rewarding

Lucene scales very well

Good relevance ranking is

expensive

SSD is very cost effective

jpeg2000 would have been nice

..one day

Where to from here?

Things

we should have done

better

established a representative test environment

adequately provisioned newspaper image storage

acquired content more cost-effectively

created a critical mass of journal articles

continual improvement on newspaper delivery

recognised the primacy of digital

What are libraries about?

Books?

Bits?

Nathan Torkington

Libraries: Where It All Went Wrong

Like Microsoft, libraries are hoping to stay relevent with "digital bolt-ons"

What are libraries about?

Universal Access

Ideas

Conversations

People

"You’ve added digital after the fact. You probably have special digital groups, probably (hopefully) made up of younger people than the usual library employee.

Congratulations, you just reproduced Microsoft’s strategy: let’s build a few digital bolt-ons for our existing products. Then let’s have some advance R&D guys working on the future while the rest of us get on with it. But think about that for a second. What are the rest of us working on, if those young kids are working on the future? Ah, it must be the past.

So what you’ve effectively done is double-down on the past."

Rose Holley

Librarian

Warwick Cathro

Librarian

2007-2012

Manager of both the Newspapers Digitisation

and Trove projects

NLA, 1978-2011

Chair of both the Newspapers Digitisation and Trove project boards

#

Choose a template

Creativity - Paint

For grant requests, funding pitches, program proposals, or any other kind of education or nonprofit presentation, this Prezi template is the way to generate interest and momentum. Like all Prezi education templates and Prezi nonprofit templates, it’s easily customizable.

Annual Report - Mountain

Every business year has its ups and downs, and you’ll end yours on a high note with this Prezi annual review template featuring a symbolic mountain motif. Like all of our annual review templates, this one can be easily customized with your own topics, images, and data.

Marketing Agency Pitch

Get your work noticed and remembered with an engaging, visually stunning Prezi marketing agency presentation. The simple-yet-clever motif lets you show your ideas in context and—like all Prezi marketing strategy templates—is easily customized to make your own.

See more templates →

Presentations from around the world

Quiz show

Jamie Clark

Increase Assessoria

Nathalia Gabriela

Deber de Anthony Muentes 6to c Apple

UNEIN DEL PACÍFICO

See staff picks →

Learn more about creating dynamic, engaging presentations with Prezi

Why Prezi is better