Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Not a total experimental failure - an experience report on the Trove architecture
Transcript of Not a total experimental failure - an experience report on the Trove architecture
- an experience report on the
Trove architecture Mark Triggs Kent Fitch Scalability
Directions More data
More integration Trove is big... Not big like Google NLA revenue Google revenue Big, like the local bully Trove
PA(PB) Each day: 30K visits
400K pageviews ..."mostly newspapers" Updates averaged per day: Gale
Misc OAI sources
Books etc - 80GB
Articles - 350GB
Newspapers - 280GB
Pandora - 260GB
People - 7GB 5 Lucene indices 2 mySQL dbs Trove - 500GB
Newspapers - 600GB + 60TB of newspaper image derivatives transactions - Trove UI berry
largo update "master"
slave 6 servers + prod mySQL server servers are pretty similar 64GB memory
8 on stick and largo no ssd on berry ~$10-$15K each SSD makes all the difference when
querying a large index normal disk random read takes 5-10ms
SSD random read takes 0.1ms normal disk costs 10 cents/GB
fancy disk costs x10 more
SSD costs $1.80/GB
still, just ~15% of server cost index
slave local SSD queries responses SAN (disk) index
master updates index
slave local SSD index
got a new
index for me? A:
take this... write index 2 copies of each index are distributed across the slaves A slave can fail, and the system still works (3 copies for newspapers) Load balancers direct requests to slaves The Trove "UI" JVM is managed like an index slave take it away, Mark... Scale a bit more Add more index copies (slaves) Split ("shard") index,
then replicate shards Scale a lot more House of cards? DB servers
front end dispatcher
general mayhem Single points of failure.. a bit complicated... ...vast amounts of data compared to LA NLA Harvester CBS pusher NCM Trove UI Pandas Growth 100's of millions of articles?
whole domain web harvest ~100TB?
mass book digitisation?
ongoing newspapaper & magazine digitisation?
AustLit? ! don't mention OAIster ! Hathi mirror? (8.6m volumes, 7k tons, 400TB)
electronic books? http://www.austlit.edu.au:7777/mockups/trove/home.html "We don't need no stinkin'
hardware maintainence" Overhead of "external" load balancing? What is
Trove? Can we shard and