Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


SDEC 2012: Apache Kafka

Inside LinkedIn's distributed publish/subscribe messaging system

Richard Park

on 17 September 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of SDEC 2012: Apache Kafka

Apache Kafka Inside 's distributed publish/subscribe messaging system Who Am I? ` Outline Why Kafka?
What is Kafka?
How? Kafka @ LinkedIn What we want Decisions
etc Kafka Nutshell Written in Scala
Persists logs to disk
High throughput, low latency Design Performance Data @ LinkedIn Kafka is data agnostic. Linkedin is not
We want compact binary serialization format
Standardization good.
Prefer one serialization format for messages and storage (files)
Apache Avro! Evaluated existing solutions Scale thousands of producers and consumers
high through-put. billions to trillions of messages/day
grow system naturally Robust Handle different types of usage

Consume data online and offline Latency Fairly fast
Seconds i.e. logs, messages, activity events Reliable As few messages lost as possible
Fault tolerant
Safe (i.e. play nicely with others)
Persistent Before Kafka ActiveMQ for Messaging
Splunk for Logging
JMX to Zenoss for metrics
Custom etl, databus for database data
In house log aggregation for user activity User Activity Data Instead, we made Kafka I.E. Page views, Profile views, Ads Impression ... Doesn't Scale with Growth now at 175M+
more global members persist data efficiently
consume data online/offline
pull vs push
and we're slightly crazy Broker Persistence Write to disk. Are we crazy?
Random access bad, sequential good
Take advantage of file system caching Partitioning Data file blocks for cache and retention Producer Consumer Events, Message, Logs... oh my! message payload can be anything
various api: rest, java, c++ etc.
load balance, zookeeper or hardware
batch messages to reduce overhead (i.e. network...)
compress batched events to lower network bandwidth, disk write resources, at cost of a few cpu cycles. Use zookeeper to keep track of brokers Zookeeper Multiple partitions for large events for parallelism 'Offsets' determine message location in data Node failure detected by Zookeeper
Unconsumed data on failed node lost (disk failure)
Replication coming soon! Pull Data Broker location (i.e. Node Id, IP address)
Topic name (i.e. Page views)
Partition # (i.e. Partition 1)
Offset into data (i.e. 20239303?) What you need to pull from broker? Let's be lazy Kafka Consumer can do management for you.
Uses Zookeeper to store offsets for each node and partition
Just need to specify 'Topic'
Due to batching, may not get single message, but a set of messages
Data ordered within partition, but not across multiple partition Broker location, partitions from Zookeeper Topic, Partition #, Offset From client 2 Linux Boxes Cooked Test Parameters Broker Producer Consumer 16 - 2.0 GHz Cores
6 - 7200 rpm SATA, RAID 10
24GB Ram
1Gb Network 200 byte messages
200 message batches
40K batch size Parameters Results 90 MB/sec throughput Parameters 100 topics
100K broker flush interval 1MB Consumer Batch size Parameters 60 MB/sec throughput
220 ms Latency from producer Results LinkedIn numbers (somewhat outdated) 10 billion message writes per day
55 billion message reads per day
Peak MB/s incoming 30 MB/s
Peak MB/s outgoing 170 MB/s
3x compression ratio (GZIP)
10 sec avg latency from live data centers to offline
6 min Average time produced data appears in Hadoop
>350 topics LinkedIn's Setup 8 servers/data center
Each server, 6TB of storage, RAID 10 using SATA
Datacenter local Kafka clusters replicates to central data center @ LinkedIn What do we use Kafka for? Future The End Hadoop Why Avro? Dynamic typing
No code generation
Intuitive schema evolution
We like JSON. Avro Schema Avro uses json schemas
Schema used to write is needed to read
Passing schema with every message = big and fat
Solution? SchemaRegistry Schema Registry Schema Validation Check validity of new schemas
Only valid schemas can be registered
Only allow backward compatible Avro changes
Provide tools to check validity of schemas Primary consumer of kafka data
All messages available to Pig, Hive, etc Pulling Data Auditing Run regularly schedule map-reduce pull job
Mappers pull from individual brokers in parallel
Handles offset by itself. Zookeeper only for discovery
Uses schema registry to deserialize Avro Data Trust Use Kafka to monitor itself! Does # in = # out?
We want to catch failures and be confident in quality.
We need to be notified when messages are missing or delayed Producer counts # of events sent
Tiers (brokers) counts # of events pass through
Hadoop counts # of events pulled
Each count published as a special audit message to kafka
Tool consumes audit message from kafka like regular data feed
find # of events lost, where dropped, and data delay Replication
Stream processing? For more info:
http://incubator.apache.org/kafka Current State successfully deployed at Linkedin and several other companies
Apache Incubator
Current version 0.7.1 Thank you! More Hadoop Mapper partitions data into hourly partitions
Pull Avro, store Avro
Hourly data is collected into daily partition.

Avg 6 mins latency end-to-end, available to users in pig and hive Store these agregated 'counter' events in a db
Kafka Monitor tool shows errors, topic event count
Alerts on missing and delayed data Kafka Monitor Tool Local schema cache checked Validates schema schema to remote registry. Key is MD5 Hash of schema Data is serialized. MD5 Hash for schema added to message payload less files
larger files
less total data size (compress better)
less map reduce task overhead
less strain on name node
less angry operations people * (like sharding) data often delayed
only used for offline How is the data used in Hadoop? Data Products PYMK
Recommendations (Ads!)
Standardization Analytics Business operations
Full transcript