Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Do you really want to delete this prezi?
Neither you, nor the coeditors you shared it with will be able to recover it again.
Make your likes visible on Facebook?
You can change this under Settings & Account at any time.
Transcript of Apache Cassandra
# Data Center Two
# Analytics Replication Group
# default for unknown nodes
default=DC3:RAC1 peer-to-peer inter-node communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about gossip message has a version associated with it, so that during a gossip exchange, older information is overwritten with the most current state for a particular node instead of having a fixed threshold for marking nodes without a heartbeat as down an accrual detection mechanism calculates a per-node threshold that takes into account network conditions, workload, or other conditions that might affect perceived heartbeat rate running nodes try to periodically initiate gossip contact with failed nodes to see if they are back up Hinted Handoff if a replica is known to be down or fails to acknowledge the write the coordinator will store a hint for it hint consists of the target replica and the mutation to be replayed by default hints are only saved for one hour after a replica fails Read/write anywhere architecture when a replica that is storing hints detects that the failed node is alive again, it will begin streaming the missed writes Anti-Entropy Node Repair ensures that all data on a replica is made consistent Read Repair background read repair requests are sent to any additional replicas that did not receive a direct request coordinator compares the data from all the remaining replicas that own the row in the background, and if they are inconsistent, issues writes to the out-of-date replicas to update the row to reflect the most recently written values read repair probability is configured per column family deleted items are tombstoned
(marked for deletion) minimizes amount of transferred data used to detect inconsistencies between replicas Merkle Tree data structure containing a tree of summary information about a larger piece of data to verify its contents Merkle Trees are built per column family Performance clients can connect to any node in the cluster and that node (coordinator) will route the request to the right replicas tree of data block`s hashes Writes write to a commit log then write to in-memory table structure (memtable) writes are batched in memory and periodically written to disk to a persistent table structure (SSTable) memtables and SSTables are maintained per column family memtables are organized in sorted order by row key SSTables are immutable no read before write no data overwriting = no race conditions writes are idempotent Reads each SSTable has a Bloom filter associated with it that checks if a requested row key exists in the SSTable before doing any disk seeks false positive are possible, but false negatives are not row must be combined from all SSTables that contain columns from the row in question, as well as from unflushed memtable rows are stored in sorted order key cache for rows that are accessed frequently optional row cache Indexes optional secondary indexes on column values primary index for a column family is the index of its row keys Partitioning by default 1 row key out of every 128 is sampled index efficiency vs memory usage Compaction merge row fragments together discard tombstones rebuild primary and secondary indexes once a newly merged SSTable is complete, the input SSTables are marked as obsolete and eventually deleted temporary spike in disk space usage and disk I/O fewer SSTable files on disk that need to be checked in order to complete a read request partitioner determines which node to store the data on ByteOrderedPartitioner RandomPartitioner orders rows lexically by key bytes efficient range scans over rows administrative overhead to load balance the cluster consistent hashing MD5 hash value of row key background process that combines data from SSTables even data distribution dynamic snitch layer routes requests away from poorly-performing nodes load balancing by replication most performance gains by IO reduction Tunable consistency data availability and response time vs. data consistency for any given read or write operation the client application decides how consistent the requested data should be (nodes_written + nodes_read) > replication_factor consistency ensured when: Write consistency levels specifies on how many replicas the write must succeed before returning an acknowledgement to the client application specifies how many replicas must respond before a result is returned to the client application writes are always sent to all replicas for the specified row regardless of the consistency level ANY
ALL quorum is calculated as: (replication_factor / 2) + 1 quorum is calculated as: (replication_factor / 2) + 1 (hinted handoff) Scalability Data driven consistency use thresholds to manage consistency cash withdraw example amount consistency level < 1 000 $
1 000 $ - 10 000$
> 10 000$ ONE
EACH_QUORUM attempting full ACID compliance in distributed systems is a bad idea (and actually impossible in the strictest sense) Data consistency relaxing data consistency != data corruption single row updates are atomic, everything else is not linear horizontal scalability on commodity hardware Why isn't the system fast enough?
Was it fast enough with only one user?
Then the challenge is scalability, not performance denormalize data Adding nodes to the cluster calculate the tokens for new nodes based on the expansion strategy minimum possible set of data is effected SSTables are sorted by row key data is partitioned by row key RandomPartitioner uses consistent hashing does not require reconfiguration Monitoring log4j nodetool JMX Datastax OpsCenter ring ownership continuously gossiped between nodes BigTable Google Amazon Dynamo Cassandra Bigtable: A Distributed Storage System for Structured Data Dynamo: Amazon’s Highly Available Key-value Store http://research.google.com/archive/bigtable-osdi06.pdf http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf data model architecture Key thoughts storage is cheap ACID semantics not needed for all use cases memory access << network access < local disk access Questions Thank You logical way to organize data as a ring greatly simplifies adding/removing nodes NoSQL benefits drawbacks higher performance
less administrative overhead limited transactions
limited ad-hoc query capabilities different data models
different consistency/transaction models
different API Staged Event-Driven Architecture software architecture that decomposes a complex application into a series of stages connected with queues an operation starts at one stage, which then passes it asynchronously to another stage, which then passes it to another stage etc., until it ends each stage is associated with a thread pool and this thread pool executes the operation when it’s convenient to it (determined by resource availability) higher level of concurrently (rounded down to a whole number) all nodes are the same Gossip protocol Bloom filters Cache key cache holds the location of keys in memory on a per-column family basis row cache holds the entire contents of the row in memory network is reliable
latency is zero
bandwidth is infinite
network is secure
topology doesn't change
there is one administrator
transport cost is zero
network is homogeneous Fallacies of distributed computing (rounded down to a whole number) Read consistency levels generate several hashes per key and mark the buckets for each key check each bucket, if any is empty the key was never inserted better resource management Use the right tool for the job if all you have is a hammer, everything looks like a nail there is no golden hammer Load balancing Kamil Szymański firstname.lastname@example.org 27 November 2012 Warszawa JUG #103 read request are served by the replica node closest to the coordinator node that received the request, unless the dynamic snitch determines that the node is performing poorly and routes it elsewhere space-efficient probabilistic data structure speed vs. precision