Loading presentation...
Prezi is an interactive zooming presentation

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Scaling Data

No description
by

Ben Stopford

on 8 March 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Scaling Data

Scaling Data
File System
Reading Data
Key-Value
Mimics Map/Dictionary over a file
Far easier to work with than a raw file system
Easy to scale linearly. Why?
The foundation of everything
Simple: Pointers directly into physical storage
Simple scalability using multiple files
... or use a distributed filesystem such as HDFS
Limited:
Read access only via pointers
Append only for writes (without additional software)
Map/Dictionary storage naturally shards over many machines
This is known as the Shared-Nothing Architecture
Powers the Web
For scalability K-V Rocks!
Query
A query can't be routed to 'the right machine'
It can't leverage the 'sharding key'
Client 'knows' to only go to one machine
Key (derivative) query
All other query types
(map reduce but may be index driven)
Joins
Joins inevitably mean moving data all over the place :(
Writing
Data

Shared
Nothing
More machines => more write throughput
Writes naturally scale as data is sharded
Transactions are limited by cross-machine communication (as joins were)
Relational
Relational databases are AMAZING pieces of software, don't underestimate them!
They cope with a huge number of different concerns
40 Years of academic research behind them
Tradeoffs:
(Key-Value aside)
Fast writes generally imply slower reads, as data grows

(indexing, columnar etc all add cost to write performance)
NoSQL
Map Reduce
Batch-style processing
Great for operations that cannot be expressed in SQL (image processing or free text analysis)
Very flexible
Bad for query efficiency (brute force approach)
Low level interface to data (bytes)


Key Value Stores
- Often split into K-V/Document/Column/Graph/Caching categories
- Leads to confusion (to my mind) as architecturally they are similar (except graph DBs which are very different and niche)
CAP: (Brewer's CAP Theorum)
It is impossible to have a system with more than two of: Consistency, Availability and Partition Tolerance
C-A
A-P
Banks like C-A
Internet companies like A-P
So Which do you choose?
OLTP
OLAP
Favours query performance over write latency
Favours write latency over query performance
For keeping your data safe, you can't beat a relational database.
The relational model is great for data your author
Relations help keep you data meaningful
In RBS all systems of record are Sybase, Oracle of MSSQL
Good for pass-through data
Bad for system of records (data aging problem)
Can be done relationally with blobs
System of Record
Schema-free
Not having a schema makes it faster to import and manage data, particularly data you did not author.
Prefer a tandem model where the relational database is used for query functionality and the K-V store is used for scale
Throughput
For throughput above several Gb/s sharding in the K-V model can't be beaten.
Perfect for large grid use cases
RBS Enterprise Cache (or Coherence)
Cohernece
In-memory /bespoke data processing
Continuous queries
Streaming data
ParAccel/IQ
Aggregation over large datasets

Advanced Features
Points to take away:
- No one solution is perfect, but with databases the problems are often of our own making
- Only Key based access, using hashing over a shared nothing architecture, will scale linearly for both reads and writes.
- To scale other query types
linearly
can ONLY be done through replication (and that impacts writes)
- Providing access to multiple replicas makes consistency much harder
- Key-Value solutions use hashing over a shared nothing architecture to achieve genuine linearly scalability
- To scale other query types
linearly
can ONLY be done through replication
- Providing access to multiple replicas makes consistency much harder
Linear Scalability
Moving beyond this there has been a 'Cambrian explosion' of different solutions
- Any form of replication causes data coherence issues
- These impede write performance in distributed architectures.
Replication
The Problem of Distributed Locking when Replicas Exist
Different Patterns at RBS Markets:
Coherence
Read-through caching (Fire/SystemX)
Pre-built cache (ODC/Pheonix)
In-memory processing (Foundry)
OLTP
Many, notably the General Ledger
OLAP
Sales MI datamart (ParAccel)
Schemaless Reporting
SystemX feed reporting (MongoDB)
Relational
Database
- When you need to scale, explore the limits of your current technology and use K-V only where you need extra throughput
Full transcript