Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

C* rocks on…

( get rid of JOINs! )

  • high/fast write throughput
  • schema CQL design
  • time-to-live on data
  • size of data
  • total load scale out
  • uptime
  • tunable consistency vs availability

}

a few uses…

Cassandra (and Hadoop) case study

#1 Users Search History

#2 Fraud detection

#3 IP-to-Geography

#4 Message Inbox

#5 Microservices metrics

#6 Event Statistics

C* is moving way quick!

lessons learnt

MACHINE SPECS

  • 24 CPU
  • 50 Gb RAM
  • 5.5 Tb disks RAID50
  • 100Gb SSD (commit logs)
  • 100Gb SSD (HDFS)
  • NOOP IO kernel scheduler

casssandra-2.0.11

  • Avoid OrderPreservingPartitioner
  • Avoid custom serialised data (transparency is gold)
  • Avoid skinny rows on non-commodity machines

  • CQL3 rocks (always!)
  • Use json over maps (until you need maps)

  • Don't run JobTracker+NameNode on a C node
  • Upgrading to vnodes tedious (~2 months)
  • During C ops
  • plan carefully, test strategies, test clients,
  • easy to bump CL.ONE -> CL.QUORUM,
  • stop Hadoop

  • Heed disk latency + utilisation
  • >20% asking for trouble
  • SSD for commit-log, separate SSD for HDFS,

  • Comprehensive monitoring - detect problems quick
  • C is robust and will hide problems
  • monitor gc, and pending actions growing,
  • look out for spikes in 95th percentile

  • Stay under capacity!
  • easy backup, repair, streaming, etc

(Cassandra writes sequential)

Y

Time-Series Event Tracking and Aggregation

Time-Series

Event Tracking and Aggregation

a la

"Event Statistics"

Time-Series Event Tracking and Aggregation

Kafka

Scribe

the real world…

  • active development
  • clustered (sync msgs)
  • stream processing!

  • a lot more servers
  • zookeeper
  • async msgs
  • decentralised
  • simple ops

  • archaic options
  • lost buffers

Time-Series Event Tracking and Aggregation

Still using super columns?

Time-Series Event Tracking and Aggregation

SizeTiered

Leveled

DateTiered

SizeTiered

Leveled

DateTiered

Time-Series Event Tracking and Aggregation

milliseconds

DateTiered

Leveled

SizeTiered

Time-Series Event Tracking and Aggregation

DateTiered

Leveled

SizeTiered

Splits

RecordReader

Time-Series Event Tracking and Aggregation

Each day:

  • 5000+ minute jobs
  • 7 daily jobs
  • + ad-hoc jobs
  • ~ 1 billion records read from C
  • ~ 150M records written to C

CL.ALL

CL.ONE

Data Locality !

Time-Series Event Tracking and Aggregation

Simple Approach:

favour localhost location

TokenAwarePolicy(DcAwareRoundRobinPolicy(local-dc))

and use consistency level LOCAL_ONE

Proper approach:

hadoop-2.2 & Capacity Scheduler

capacity-scheduler.xml "yarn.scheduler.capacity.node-locality-delay = 3"

core-site.xml "net.topology.script.file.name = <your-topology-script>"

User Centric Statistics

Simple…

Time-Series Event Tracking and Aggregation

Advanced…

  • graphing: user->ad, ad->user, etc
  • Mahout ("Taste") --> Myrrix
  • Spark ALS

http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive

http://www.acunu.com/2/post/2011/08/scaling-up-cassandra-and-mahout-with-hadoop.html

datacenter

nydalen

datacenter

postgirobygget

DC 1

DC 2

DC 1 _FAST

DC 2 _FAST

(+MapReduce +Spark)

  • Secondary Indexes
  • Compression
  • CQL3
  • CQL tracing
  • Automatic pagination
  • async and unlogged batch statements
  • fluent api to cql java driver
  • Counters-2 !!

(Cassandra-2.1)

DC 1

DC 2

DC 1 _FAST

DC 2 _FAST

Learn more about creating dynamic, engaging presentations with Prezi