C* rocks on…
( get rid of JOINs! )
- high/fast write throughput
- schema CQL design
- time-to-live on data
- size of data
- total load scale out
- uptime
- tunable consistency vs availability
a few uses…
Cassandra (and Hadoop) case study
#1 Users Search History
#2 Fraud detection
#3 IP-to-Geography
#4 Message Inbox
#5 Microservices metrics
#6 Event Statistics
C* is moving way quick!
lessons learnt
MACHINE SPECS
- 24 CPU
- 50 Gb RAM
- 5.5 Tb disks RAID50
- 100Gb SSD (commit logs)
- 100Gb SSD (HDFS)
- NOOP IO kernel scheduler
casssandra-2.0.11
- Avoid OrderPreservingPartitioner
- Avoid custom serialised data (transparency is gold)
- Avoid skinny rows on non-commodity machines
- CQL3 rocks (always!)
- Use json over maps (until you need maps)
- Don't run JobTracker+NameNode on a C node
- Upgrading to vnodes tedious (~2 months)
- During C ops
- plan carefully, test strategies, test clients,
- easy to bump CL.ONE -> CL.QUORUM,
- stop Hadoop
- Heed disk latency + utilisation
- >20% asking for trouble
- SSD for commit-log, separate SSD for HDFS,
- Comprehensive monitoring - detect problems quick
- C is robust and will hide problems
- monitor gc, and pending actions growing,
- look out for spikes in 95th percentile
- Stay under capacity!
- easy backup, repair, streaming, etc
(Cassandra writes sequential)
Y
Time-Series Event Tracking and Aggregation
Time-Series
Event Tracking and Aggregation
a la
"Event Statistics"
Time-Series Event Tracking and Aggregation
Kafka
Scribe
the real world…
- active development
- clustered (sync msgs)
- stream processing!
- a lot more servers
- zookeeper
- async msgs
- decentralised
- simple ops
- archaic options
- lost buffers
Time-Series Event Tracking and Aggregation
Still using super columns?
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Splits
RecordReader
Time-Series Event Tracking and Aggregation
Each day:
- 5000+ minute jobs
- 7 daily jobs
- + ad-hoc jobs
- ~ 1 billion records read from C
- ~ 150M records written to C
Data Locality !
Time-Series Event Tracking and Aggregation
Simple Approach:
favour localhost location
TokenAwarePolicy(DcAwareRoundRobinPolicy(local-dc))
and use consistency level LOCAL_ONE
Proper approach:
hadoop-2.2 & Capacity Scheduler
capacity-scheduler.xml "yarn.scheduler.capacity.node-locality-delay = 3"
core-site.xml "net.topology.script.file.name = <your-topology-script>"
User Centric Statistics
Time-Series Event Tracking and Aggregation
Advanced…
- graphing: user->ad, ad->user, etc
- Mahout ("Taste") --> Myrrix
- Spark ALS
http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive
http://www.acunu.com/2/post/2011/08/scaling-up-cassandra-and-mahout-with-hadoop.html