Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Scaling Data

No description

Ben Stopford

on 8 March 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Scaling Data

Scaling Data
File System
Reading Data
Mimics Map/Dictionary over a file
Far easier to work with than a raw file system
Easy to scale linearly. Why?
The foundation of everything
Simple: Pointers directly into physical storage
Simple scalability using multiple files
... or use a distributed filesystem such as HDFS
Read access only via pointers
Append only for writes (without additional software)
Map/Dictionary storage naturally shards over many machines
This is known as the Shared-Nothing Architecture
Powers the Web
For scalability K-V Rocks!
A query can't be routed to 'the right machine'
It can't leverage the 'sharding key'
Client 'knows' to only go to one machine
Key (derivative) query
All other query types
(map reduce but may be index driven)
Joins inevitably mean moving data all over the place :(

More machines => more write throughput
Writes naturally scale as data is sharded
Transactions are limited by cross-machine communication (as joins were)
Relational databases are AMAZING pieces of software, don't underestimate them!
They cope with a huge number of different concerns
40 Years of academic research behind them
(Key-Value aside)
Fast writes generally imply slower reads, as data grows

(indexing, columnar etc all add cost to write performance)
Map Reduce
Batch-style processing
Great for operations that cannot be expressed in SQL (image processing or free text analysis)
Very flexible
Bad for query efficiency (brute force approach)
Low level interface to data (bytes)

Key Value Stores
- Often split into K-V/Document/Column/Graph/Caching categories
- Leads to confusion (to my mind) as architecturally they are similar (except graph DBs which are very different and niche)
CAP: (Brewer's CAP Theorum)
It is impossible to have a system with more than two of: Consistency, Availability and Partition Tolerance
Banks like C-A
Internet companies like A-P
So Which do you choose?
Favours query performance over write latency
Favours write latency over query performance
For keeping your data safe, you can't beat a relational database.
The relational model is great for data your author
Relations help keep you data meaningful
In RBS all systems of record are Sybase, Oracle of MSSQL
Good for pass-through data
Bad for system of records (data aging problem)
Can be done relationally with blobs
System of Record
Not having a schema makes it faster to import and manage data, particularly data you did not author.
Prefer a tandem model where the relational database is used for query functionality and the K-V store is used for scale
For throughput above several Gb/s sharding in the K-V model can't be beaten.
Perfect for large grid use cases
RBS Enterprise Cache (or Coherence)
In-memory /bespoke data processing
Continuous queries
Streaming data
Aggregation over large datasets

Advanced Features
Points to take away:
- No one solution is perfect, but with databases the problems are often of our own making
- Only Key based access, using hashing over a shared nothing architecture, will scale linearly for both reads and writes.
- To scale other query types
can ONLY be done through replication (and that impacts writes)
- Providing access to multiple replicas makes consistency much harder
- Key-Value solutions use hashing over a shared nothing architecture to achieve genuine linearly scalability
- To scale other query types
can ONLY be done through replication
- Providing access to multiple replicas makes consistency much harder
Linear Scalability
Moving beyond this there has been a 'Cambrian explosion' of different solutions
- Any form of replication causes data coherence issues
- These impede write performance in distributed architectures.
The Problem of Distributed Locking when Replicas Exist
Different Patterns at RBS Markets:
Read-through caching (Fire/SystemX)
Pre-built cache (ODC/Pheonix)
In-memory processing (Foundry)
Many, notably the General Ledger
Sales MI datamart (ParAccel)
Schemaless Reporting
SystemX feed reporting (MongoDB)
- When you need to scale, explore the limits of your current technology and use K-V only where you need extra throughput
Full transcript