Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Apache Cassandra

No description
by

Kamil Szymański

on 27 November 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Apache Cassandra

Polyglot Programming http://memeagora.blogspot.com/2006/12/polyglot-programming.html Pretty much any computer you buy has multiple processors in it, so we're going to have to get better writing threading code. Yet, as anyone who has read Java Concurrency in Practice by Brian Goetz (an exceptional book, by the way), writing good multi-threading code is hard. Very hard. So why bother? Why not use a language that handles multiple threads more gracefully? Like a functional language? Functional languages eliminate side effects on variables, making it easier to write thread-safe code. It's all about choosing the right tool for the job and leveraging it correctly. December 2006 Neal Ford Polyglot Persistence Complex applications combine different types of problems, so picking the right language for the job may be more productive than trying to fit all aspects into a single language. Different languages are suitable for tackling different problems http://martinfowler.com/bliki/images/polyglotPersistence/polyglot.png Different kinds of data with different access patterns and different requirements I'm confident to say that if you starting a new strategic enterprise application you should no longer be assuming that your persistence should be relational. The relational option might be the right one - but you should seriously look at other alternatives. Martin Fowler November 2011 http://martinfowler.com/bliki/PolyglotPersistence.html So polyglot persistence will come at a cost - but it will come because the benefits are worth it. Scaling to lots of traffic gets harder and harder to do with vertical scaling. NoSQL Document Oriented CouchDB FlockDB MongoDB Graph Oriented Neo4j Key-Value Stores Redis Voldemort Column Based Cassandra HBase Apache Cassandra is a free, open-source, highly scalable, distributed database system for managing large amounts of data. RDBMS disadvantages hard to scale ineffective in handling large volumes of data lots of IO object-relational impedance mismatch painful data model changes Project triangle CAP Theorem each client can always read and write all clients always have the same view of data the system works despite physical network partitions Apache Cassandra row consists of row key and a set of columns each row can have a different set of columns container for columns and rows Column family each row is uniquely identified by its row key no fixed schema Column column is the smallest increment of data rows are sorted by row key columns are sorted by column name timestamp is used to determine the most recent update to a column scalability, performance and fault tolerance in distributed databases timestamp is provided by the client the latest timestamp always wins when requesting data column is a tuple containing a name, a value and a timestamp Mutating data one operation for all data modifications (inserts, updates, deletes) latest timestamp wins Cluster total amount of data managed by the cluster is represented as a ring ring is divided into ranges equal to the number of nodes before a node can join the ring, it must be assigned a token token value determines the node's position in the ring and its range of data data is partitioned across the nodes based on the row key Fault tolerance Gossiped data usage Replication by default replicas are placed on the next nodes clockwise in the ring written in Java failures will happen handle failure as a normal part of operations P2P architecture no single point of failure mutations are idempotent process of storing copies of data on multiple nodes to ensure reliability and fault tolerance Cassandra stores copies (replicas) of each row based on the row key number of replicas (replication factor) is set at keyspace level all replicas are equally important NetworkTopologyStrategy Replication strategy failures within a datacenter tend to be highly correlated SimpleStrategy NetworkTopologyStrategy allows to specify how many replicas you want in each data center within datacenters replicas are placed on different racks relies on snitch to place replicas correctly across data centers and racks Snitches RackInferringSnitch assumes the topology of the network by the octet of the node's IP address PropertyFileSnitch EC2Snitch for deployments on Amazon EC2 where all nodes are within a single region EC2MultiRegionSnitch for deployments on Amazon EC2 where the cluster spans multiple regions uses a user-defined description of the network details # Data Center One
175.56.12.105=DC1:RAC1
175.50.13.200=DC1:RAC1
120.57.102.103=DC1:RAC2

# Data Center Two
110.56.12.120=DC2:RAC1
50.45.14.220=DC2:RAC2
50.17.10.203=DC2:RAC3

# Analytics Replication Group
172.106.12.120=DC3:RAC1

# default for unknown nodes
default=DC3:RAC1 peer-to-peer inter-node communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about gossip message has a version associated with it, so that during a gossip exchange, older information is overwritten with the most current state for a particular node instead of having a fixed threshold for marking nodes without a heartbeat as down an accrual detection mechanism calculates a per-node threshold that takes into account network conditions, workload, or other conditions that might affect perceived heartbeat rate running nodes try to periodically initiate gossip contact with failed nodes to see if they are back up Hinted Handoff if a replica is known to be down or fails to acknowledge the write the coordinator will store a hint for it hint consists of the target replica and the mutation to be replayed by default hints are only saved for one hour after a replica fails Read/write anywhere architecture when a replica that is storing hints detects that the failed node is alive again, it will begin streaming the missed writes Anti-Entropy Node Repair ensures that all data on a replica is made consistent Read Repair background read repair requests are sent to any additional replicas that did not receive a direct request coordinator compares the data from all the remaining replicas that own the row in the background, and if they are inconsistent, issues writes to the out-of-date replicas to update the row to reflect the most recently written values read repair probability is configured per column family deleted items are tombstoned
(marked for deletion) minimizes amount of transferred data used to detect inconsistencies between replicas Merkle Tree data structure containing a tree of summary information about a larger piece of data to verify its contents Merkle Trees are built per column family Performance clients can connect to any node in the cluster and that node (coordinator) will route the request to the right replicas tree of data block`s hashes Writes write to a commit log then write to in-memory table structure (memtable) writes are batched in memory and periodically written to disk to a persistent table structure (SSTable) memtables and SSTables are maintained per column family memtables are organized in sorted order by row key SSTables are immutable no read before write no data overwriting = no race conditions writes are idempotent Reads each SSTable has a Bloom filter associated with it that checks if a requested row key exists in the SSTable before doing any disk seeks false positive are possible, but false negatives are not row must be combined from all SSTables that contain columns from the row in question, as well as from unflushed memtable rows are stored in sorted order key cache for rows that are accessed frequently optional row cache Indexes optional secondary indexes on column values primary index for a column family is the index of its row keys Partitioning by default 1 row key out of every 128 is sampled index efficiency vs memory usage Compaction merge row fragments together discard tombstones rebuild primary and secondary indexes once a newly merged SSTable is complete, the input SSTables are marked as obsolete and eventually deleted temporary spike in disk space usage and disk I/O fewer SSTable files on disk that need to be checked in order to complete a read request partitioner determines which node to store the data on ByteOrderedPartitioner RandomPartitioner orders rows lexically by key bytes efficient range scans over rows administrative overhead to load balance the cluster consistent hashing MD5 hash value of row key background process that combines data from SSTables even data distribution dynamic snitch layer routes requests away from poorly-performing nodes load balancing by replication most performance gains by IO reduction Tunable consistency data availability and response time vs. data consistency for any given read or write operation the client application decides how consistent the requested data should be (nodes_written + nodes_read) > replication_factor consistency ensured when: Write consistency levels specifies on how many replicas the write must succeed before returning an acknowledgement to the client application specifies how many replicas must respond before a result is returned to the client application writes are always sent to all replicas for the specified row regardless of the consistency level ANY
ONE
TWO
THREE
QUORUM
LOCAL_QUORUM
EACH_QUORUM
ALL ONE
TWO
THREE
QUORUM
LOCAL_QUORUM
EACH_QUORUM
ALL quorum is calculated as: (replication_factor / 2) + 1 quorum is calculated as: (replication_factor / 2) + 1 (hinted handoff) Scalability Data driven consistency use thresholds to manage consistency cash withdraw example amount consistency level < 1 000 $
1 000 $ - 10 000$
> 10 000$ ONE
LOCAL_QUORUM
EACH_QUORUM attempting full ACID compliance in distributed systems is a bad idea (and actually impossible in the strictest sense) Data consistency relaxing data consistency != data corruption single row updates are atomic, everything else is not linear horizontal scalability on commodity hardware Why isn't the system fast enough?
Was it fast enough with only one user?
Then the challenge is scalability, not performance denormalize data Adding nodes to the cluster calculate the tokens for new nodes based on the expansion strategy minimum possible set of data is effected SSTables are sorted by row key data is partitioned by row key RandomPartitioner uses consistent hashing does not require reconfiguration Monitoring log4j nodetool JMX Datastax OpsCenter ring ownership continuously gossiped between nodes BigTable Google Amazon Dynamo Cassandra Bigtable: A Distributed Storage System for Structured Data Dynamo: Amazon’s Highly Available Key-value Store http://research.google.com/archive/bigtable-osdi06.pdf http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf data model architecture Key thoughts storage is cheap ACID semantics not needed for all use cases memory access << network access < local disk access Questions Thank You logical way to organize data as a ring greatly simplifies adding/removing nodes NoSQL benefits drawbacks higher performance
higher scalability
flexible datamodel
less administrative overhead limited transactions
relaxed consistency
unconstrained data
limited ad-hoc query capabilities different data models
different consistency/transaction models
different API Staged Event-Driven Architecture software architecture that decomposes a complex application into a series of stages connected with queues an operation starts at one stage, which then passes it asynchronously to another stage, which then passes it to another stage etc., until it ends each stage is associated with a thread pool and this thread pool executes the operation when it’s convenient to it (determined by resource availability) higher level of concurrently (rounded down to a whole number) all nodes are the same Gossip protocol Bloom filters Cache key cache holds the location of keys in memory on a per-column family basis row cache holds the entire contents of the row in memory network is reliable
latency is zero
bandwidth is infinite
network is secure
topology doesn't change
there is one administrator
transport cost is zero
network is homogeneous Fallacies of distributed computing (rounded down to a whole number) Read consistency levels generate several hashes per key and mark the buckets for each key check each bucket, if any is empty the key was never inserted better resource management Use the right tool for the job if all you have is a hammer, everything looks like a nail there is no golden hammer Load balancing Kamil Szymański kamil.szymanski.dev@gmail.com 27 November 2012 Warszawa JUG #103 read request are served by the replica node closest to the coordinator node that received the request, unless the dynamic snitch determines that the node is performing poorly and routes it elsewhere space-efficient probabilistic data structure speed vs. precision
Full transcript