Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

ApacheCon – Cassandra (and Hadoop) – FINN.NO

http://sched.co/1pbkS3u
by

mck semb wever

on 24 June 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of ApacheCon – Cassandra (and Hadoop) – FINN.NO

Y
#1 Users Search History
#2 Fraud detection
#3 IP-to-Geography
#4 Message Inbox
#5 Microservices metrics
#6 Event Statistics
a few uses…
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
high/fast write throughput
schema CQL design
time-to-live on data
size of data
total load scale out
uptime
tunable consistency vs availability
( get rid of JOINs! )
}
C* rocks on…
RecordReader
Splits
Still using super columns?
the real world…
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Data Locality !
Simple Approach:
favour localhost location
TokenAwarePolicy(DcAwareRoundRobinPolicy(local-dc))
and use consistency level LOCAL_ONE

Proper approach:
hadoop-2.2 & Capacity Scheduler
capacity-scheduler.xml "yarn.scheduler.capacity.node-locality-delay = 3"
core-site.xml "net.topology.script.file.name = <your-topology-script>"
Each day:
5000+ minute jobs
7 daily jobs
+ ad-hoc jobs
~ 1 billion records read from C
~ 150M records written to C
CL.ONE
CL.ALL
User Centric Statistics
Time-Series Event Tracking and Aggregation
Simple…
Advanced…
graphing: user->ad, ad->user, etc
Mahout ("Taste") --> Myrrix
Spark ALS
http://www.acunu.com/2/post/2011/08/scaling-up-cassandra-and-mahout-with-hadoop.html
Kafka
Scribe
Time-Series Event Tracking and Aggregation
active development
clustered (sync msgs)
stream processing!

a lot more servers
zookeeper
async msgs
decentralised
simple ops

archaic options
lost buffers
lessons learnt
Avoid OrderPreservingPartitioner
Avoid custom serialised data (transparency is gold)
Avoid skinny rows on non-commodity machines

CQL3 rocks (always!)
Use json over maps (until you need maps)

Don't run JobTracker+NameNode on a C node
Upgrading to vnodes tedious (~2 months)

Y
C* is moving way quick!
Secondary Indexes
Compression
CQL3
CQL tracing
Automatic pagination
async and unlogged batch statements
fluent api to cql java driver
Counters-2 !!

During C ops
plan carefully, test strategies, test clients,
easy to bump CL.ONE -> CL.QUORUM,
stop Hadoop

Heed disk latency + utilisation
>20% asking for trouble
SSD for commit-log, separate SSD for HDFS,

Comprehensive monitoring - detect problems quick
C is robust and will hide problems
monitor gc, and pending actions growing,
look out for spikes in 95th percentile

Stay under capacity!
easy backup, repair, streaming, etc

(Cassandra writes sequential)
Time-Series
Event Tracking and Aggregation

a la
"Event Statistics"
Cassandra (and Hadoop) case study

24 CPU
50 Gb RAM
5.5 Tb disks RAID50
100Gb SSD (commit logs)
100Gb SSD (HDFS)
NOOP IO kernel scheduler
MACHINE SPECS
datacenter
nydalen
datacenter
postgirobygget
casssandra-2.0.11
DC 1 _FAST
DC 2 _FAST
DC 1
DC 2
(+MapReduce +Spark)
(Cassandra-2.1)
Time-Series Event Tracking and Aggregation
http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive
DC 1 _FAST
DC 2 _FAST
DC 1
DC 2
SizeTiered
DateTiered
Leveled
SizeTiered
Leveled
DateTiered
milliseconds
SizeTiered
SizeTiered
Leveled
Leveled
DateTiered
DateTiered
Full transcript