Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Cassandra Summit – Cassandra (and Hadoop) at FINN.no

No description
by

mck semb wever

on 17 April 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Cassandra Summit – Cassandra (and Hadoop) at FINN.no

Y
Application writes
Application reads
Application deletes
writes
reads
~60 us
~9 ms
Application writes
Application reads
Application writes
ad_status
ad_created
4 replica (1 per datacenter)

Consistency Level
ONE for reads+writes
Application writes
Application reads
4 replica (1 per datacenter)

CL.QUORUM for all operations.
(w/ DowngradingConsistencyRetryPolicy)
4 replica (1 per datacenter)
heavy use of secondary indexes
CL.ONE for reads+writes
extra cf with counters
Users Search History
a few uses…
Users Search History
Users Search History
Fraud detection
Fraud detection
Message Inbox
(email address obfuscation)
Message Inbox
Message Inbox
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
high/fast write throughput
CQL schema design
time-to-live on data
size of data
total load scale out
uptime
tunable consistency vs availability
get rid of JOINs !!
}
we use C* for…
RecordReader
Splits
Still using super columns?
the real world…
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Data Locality !
Simple Approach:
favour localhost location
TokenAwarePolicy(DcAwareRoundRobinPolicy(local-dc))
with consistency level LOCAL_ONE

Proper approach:
hadoop-2.2 & Capacity Scheduler
capacity-scheduler.xml "yarn.scheduler.capacity.node-locality-delay = 3"
core-site.xml "net.topology.script.file.name = <your-topology-script>"
Each day:
5000+ minute jobs
7 daily jobs
+ ad-hoc jobs
~ 1 billion records read from C*
~ 150M records written to C*
User Centric Statistics
Time-Series Event Tracking and Aggregation
Simple…
Advanced…
graphing: user->ad, ad->user, etc
Mahout ("Taste") --> Myrrix
Spark ALS
http://www.acunu.com/2/post/2011/08/scaling-up-cassandra-and-mahout-with-hadoop.html
Kafka
Scribe
Time-Series Event Tracking and Aggregation
active development
clustered (sync msgs)
direct stream processing

a lot more servers
zookeeper
async msgs
decentralised
simple ops

archaic options
lost buffers
the blood, sweat, and tears…



CQL3 rocks (always!)
Use json over maps (until you need maps)
Stick to the datastax java driver

Upgrading to vnodes tedious (~2 months)
but worth it

Y
C* keeps throwing punches!
DateTieredCompactionStrategy
secondary indexes
compression
CQL3
CQL tracing
automatic pagination
fluent api to cql java driver
counters-2 !!

Heed disk latency and utilisation
SSD just for commit-log

Comprehensive monitoring
detect problems quickly
C* is robust and will hide problems

Stay under capacity
repair, streaming, upgrade, adding nodes, etc

Time-Series
Event Tracking and Aggregation

a la
"Event Statistics"

24 CPU
50 Gb RAM
5.5 Tb disks RAID50 / JBOD
100Gb SSD (commit logs)
100Gb SSD (HDFS)
NOOP IO kernel scheduler
MACHINE SPECS
datacenter
nydalen
datacenter
postgirobygget
cassandra-2.0.11
hadoop-2.2.0
DC 1 _FAST
DC 2 _FAST
DC 1
DC 2
Example data
IP-to-Geography
IP-to-Geography
MicroServices statistics
Users Search History
Fraud detection
(+MapReduce +Spark)
(Cassandra-2.1)
Time-Series Event Tracking and Aggregation
http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive
DC 1 _FAST
DC 2 _FAST
DC 1
DC 2
~25
~150
IP-to-Geography
4 replica (1 per datacenter)

Consistency Level
ONE for reads+writes
(email address obfuscation)
(Inbox data)
durable_writes = 'false'

Replay logs
Bounced mail Statistics
OpsCenter
systems metrics monitoring
Zipkin
Recommendations
Events
Event Statistics
Microservices connectivity
Email address aliases
Message Inbox
IP-to-Geography
Fraud detection
SizeTiered
DateTiered
Leveled
SizeTiered
Leveled
DateTiered
SizeTiered
SizeTiered
Leveled
Leveled
DateTiered
DateTiered
Averages
Time-Series Event Tracking and Aggregation
Top 100 past 24hrs
streaming…?
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
the proof's in the pudding
Performance
<= 20x read increase
10- 100x write increase

Uptime
⅓ serious production incidents
Spark RDDs
cross_node_timeout and rapid read protection
Hadoop
Don't run masters on a C* node
Separate SSD for HDFS, store everything in C*

During C* ops
plan carefully, test strategies, test clients,
easy to bump CL.ONE -> CL.QUORUM
Full transcript