Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

From Simple CQL to Time-Series Event Tracking and Aggregation Using Cassandra and Hadoop

http://sched.co/189ZYdg
by

mck semb wever

on 4 September 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of From Simple CQL to Time-Series Event Tracking and Aggregation Using Cassandra and Hadoop

Y
Application writes
INSERT INTO search (loginid, searchid, searchurl, searchkey, description)
VALUES (
userKey
, now(),
searchUrl
,
searchKey
,
description
) USING TTL (3 months);
Application reads
SELECT * FROM search where loginid=
userKey
LIMIT
limit
;
Application deletes
DELETE FROM search WHERE loginid =
userKey
;
Application writes
Application reads
~15 μs
~10 ms
100/s
1:60 read:write ratio
4 replica (1 per datacenter)

CL.ONE for read+write
datacenter
Oslo (suburbs)
datacenter
Oslo (downtown)
DC_FAST 1
DC_FAST 2
DC 1
DC 2

24 CPU
50 Gb RAM
5.5 Tb disks RAID50
100Gb SSD
(commit logs + HDFS + ext4 journal)
NOOP IO scheduler


MACHINE SPECS
Application writes
INSERT INTO ads (day, adid, millis)
VALUES (
today
,
ad.getId()
,
millis
) USING TTL (1 day);
Application reads
SELECT adid, millis FROM ads WHERE day in (
today
,
yesterday
) LIMIT 30000;
Application writes
~13 μs
only ever two rows active
4 replica (1 per datacenter)

CL.ONE for read+write
Application writes
Application reads
Application writes
~25 μs
4 replica (1 per datacenter)
CL.ONE for reads+writes
Application writes
Application reads
Application writes
~100 μs
4 replica (1 per datacenter)

CL.QUORUM for all operations.
(w/ DowngradingConsistencyRetryPolicy)
Application reads
~650 μs
Application writes
~21 μs
4 replica (1 per datacenter)
heavy use of secondary indexes
CL.ONE for reads+writes
extra cf with counters
Application reads
~1 ms
casssandra-1.2.9
INSERT INTO ad_status (adId, updated, status)
VALUES (
adId
,
today
,
status
) USING TTL (1 year);
INSERT INTO ad_created (day, hour, created, adId, rules)
VALUES (
day, hour
,
now
,
adId
, {
key1
:
value1
,
key2
:
value2
})
USING TTL (1 year);
SELECT adid, rules FROM ad_created WHERE day =
day
and hour = hour;
SELECT status FROM ad_status WHERE adid =
adId
ORDER BY updated ASC;
#1 Users Last Searches
#2 Application record
#3 Fraud detection
#4 Message Inbox
"The Apache Hadoop Ecosystem", Doug Cutting
http://oreil.ly/14SpaGl
Simple Use Cases
#1
Users Search History
#1
Users Search History
#2
Application record
#2
Application record
#3
Fraud detection
#3
Fraud detection
#4
Message Inbox
(email address obfuscation)
#4
Message Inbox
(email address obfuscation)
#4
Message Inbox
(Inbox data)
#4
Message Inbox
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
high/fast write throughput
schema free design
time-to-live on data
size of data
total load scale out
uptime
tunable consistency vs availability
( get rid of JOINs! )
}
We use Cassandra when…
RecordReader
Splits
Still using super columns?
the real world…
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Time-Series Event Tracking and Aggregation
Data Locality !
Simple Approach:
favour localhost location
TokenAwarePolicy(DcAwareRoundRobinPolicy(local-dc))

Proper approach:
hadoop-0.22.0 & FairScheduler
"mapred.fairscheduler.locality.delay"
https://github.com/michaelsembwever/cassandra-hadoop
PIG
Time-Series Event Tracking and Aggregation
http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive
Each day:
5600 minute jobs
6 day jobs
+ ad-hoc jobs
>300M records in
>50M records out
CL.ONE
CL.ALL
User Centric Statistics
Time-Series Event Tracking and Aggregation
Simple…
Advanced…
graphing: user->ad, ad->user, etc
Mahout ("Taste") --> Myrrix
Requires
Hadoop-0.22.0+
Customise inputs: RecommenderJob, ItemIdIndexMapper, ToEntityPrefsMapper
De-normalise all ad content to every c*-hadoop node
http://www.acunu.com/2/post/2011/08/scaling-up-cassandra-and-mahout-with-hadoop.html
Kafka
Scribe
Time-Series Event Tracking and Aggregation
active development
clustered (sync msgs)
stream processing!
async msgs
simple ops
archaic options
lost buffers
lessons learnt
Avoid OrderPreservingPartitioner
Avoid custom serialised data (transparency is gold)
Avoid skinny rows on non-commodity machines
CQL3 rocks (almost always)
Don't run JobTracker+NameNode on a C* node
During C* ops
easy to bump CL.ONE -> CL.QUORUM
stop Hadoop
Heed disk latency+utilisation
>20% asking for trouble
noop scheduler in kernel
noatime, move journal to SSD
minimise I/O from other processes

Y
C* has come a long way
Composite Columns
Secondary Indexes
Compression
Counters
CQL3
CQL MapReduce

(Counter-2.0 please! CASSANDRA-4775)
(C* sequential)
Time-Series
Event Tracking and Aggregation
"Event Statistics"
@mck_sw
http://tech.finn.no/

grep "Choosing data-local task" hadoop-finn-jobtracker.log
From Simple CQL
to Time-Series Event Tracking and Aggregation

@mck_sw
#CassandraEU
Full transcript