Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Keeping it real and realtime with Cassandra & Spark


Catalin Alexandru Zamfir

on 20 April 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Keeping it real and realtime with Cassandra & Spark

Keeping it real and real-time with C* and Spark
- lessons learned, the hard way, from a production setup;
- tips on compaction strategies;
- tricks of data modelling;
- operational gotchas;
Lessons, the hard-way, #1
bash # cat /etc/gameloft/presenters
catalinalexandru.zamfir@gameloft.com, Big Data Enginner
marius.ion@gameloft.com, Data Scientist
Facts on C*:
no SPOF, but that doesn't mean no-MPOF;
a BASE system, favors AP (CAP theorem);
DC-roundrobin and token-aware client drivers;
configurable write/read-path performance through CL;
simple for newbie developers (CQL3);
developer-controlled distribution of data;
allows for real-time and high-throughput scenarios;
scales horizontally and married to concurrency;
bash # tail -n1 /etc/assumptions
Basic or minimal knowledge of Cassandra

Our use-case:
data acquisiton, game events organized in packages;
avg. 6500 pckg/s approx. 561.6M packages/day;
avg. 4 events/package, meaning +2B events/day;
7 days TTL over DataTieredCompactionStrategy;
99.99% uptime, with underlying OS update & HW change, 2 to 8TB HDDs;
15 high-core & high-memory nodes;
don't mix JVM flavors
in the same cluster, you won't observe issues unless you need to grow;
there wasn't a check for the Oracle JRE in C*;
sporadic signs of
you may think are from hardware and an accumulation of such issues;
JOINING never finishes, stops with a Native* exception;
tried fixing the JRE, replacing Open JRE with Oracle JRE;

... LOST ALL DATA in production
Lessons, the hard-way, #3
compactions for ALL CFs must remain under a few hundreds pending;
prefer to pay-in-CPU-time than to be in the situation to erase-and-start-over;
to many SSTables per CFs, grow directory size on file-system, may need to TRUNCATE;
temporary spikes in "Compactions Pending" to be expected, as long as they go down in time;
Trick: "normalize" if you can
index CFs are usually 5 to 10% of data, depending on your index structure;
used to have
in CFs for
packages ==
PK (uuid)
for instant access by UUID;
packages_buffered ==
PK (second, source, time-uuid)
for batch access;
twice the data, effect of de-normalization process;
CREATE pkg_index == PK (uuid, (second, time-uuid));
generate time-uuid on
as now () is NOT a good time-UUID source, computed on coordinator (CASSANDRA-10900);
code update to do an extra step, with
for the given package UUID, go to
and return data;
Reality check
queries to PART keys
hash O (1) time + HDD seek;
queries to PART + CLUSTER keys
hash O (1) time + binary-search (O (logn)) for key + HDD seek;
drop 50% of denormalized data (packages), save a TON of resources;
freeing resource is
A Good Thing
. Normalization does not hurt if the scope is to free-up resources than the index consumes (50% de-normalized data vs. 10% index as above);
high-core & high-mem nodes have the advantage of locality;
disable internode_compression;
GossipingPropertyFileSnitch is your production-ready friend;
cross_node_timeout == yes, it's too obvious;
big timeouts (read/write) but don't set them too high (2x, 3x at most);

Even if you have RAID or JBOD and your disk R/W speeds are in the GB/s your concurrent_compactors && compaction_throughput are "tied";
growing these may bring your nodes "down" due to their latency caused by high I/O which makes Gossip to mark the node as not available;
do the math, 4 (cc) x 196 MB/s (ct) == 784 MB/s;
monitoring TIP with OpsCenter "Total Bytes Compacted" graph for (hour, day, week) should be < concurrent_compactors * compaction_throughput;
given 1 x HDD per node, magic RULE:
rounded benchmarked HDD speed / concurrent_compactors == compaction_throghput_*;
Lessons, the hard-way, #2
when using
, write-performance is
not the advertised <2ms
10ms or more
authentication is
a read-before-write with some caching;
==# of nodes,
RAM speed;
not (==) # of nodes, coordinator contacts remote node;
write-time speed == (RTT of connection) + (node latency);
our use-case
0.9 to 2.1 ms/op
, for 90% of operations over a week;
Lessons, the hard-way, #4

NOT OK for high-traffic week-ends (eg. "week-end" industry);
compacts week-end #1 with week-end #2, puts VERY high pressure on cluster;
good for constant flow, write-once/never-TTL type of data (sensors);
, good for varying traffic flow with TTL requirement;
biggest CF gets
smallest prime number for base_time_seconds
and so on (see CASSANDRA-8417);
needs to be set to prime numbers for CFs;
== same on ALL CFs, compactions trigger ALL at once;
distributed over prime numbers, very rare intersections;
Lessons, the hard-way, #5
, the community needs more;
if DTCS does not fit your use-case, C* has a pluggable compaction strategy. If you have a different traffic flow or need to go deeper, dare to write your own;
tip: saves you a TON of resources, cause you can be picky with your compactions ...
Playing with matches gets you "fired" :D ... pun intended
big data's about this big
Full transcript