Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


High Volume Data and Schema Evolution

Using Apache Kafka and Apache Avro to decouple high volume streaming data consumption and production from each other across schema changes.

Scott Carey

on 10 April 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of High Volume Data and Schema Evolution

Apache Kafka + Apache Avro Schema Evolution on Streaming Data Decoupling High Volume Data
Production and Consumption Scott Carey
scottcarey@apache.org Kafka Operate on message batches
Compress at batch level
Broker does not track delivery
Broker retains data on its own terms
Typically time or space bound
Removes oldest data first
Consumers pull from broker
Streaming or Batch
Track own progress High Throughput BigData Pub/Sub Scalable Design Data organized by Topic
Topics can have multiple partitions
Reads are delivered in consistent order per partition
Partitions scale horizontally at each level Avro Avro Data Serialization Expressive schemas
Efficient, compact binary serialization
Fields are not tagged
More compact
Potentially faster
Schema as written must be known at read time
Code generation is optional Schema Resolution Read data written with schema A using compatible schema B.
Ignore data in A not specified in B
Apply default values to fields in B not present in A
Promote types
Reorder fields Kafka Broker Topic 1 Topic 2 Topic 3 Topic 4 Message Payload Message Payload Schema ID Schema Repository Avro Maps Schemas to Schema IDs
Validates Schema Compatibility
REST interface
See AVRO-1124 Producer Consumer Produce Data with Schema: record Coffee {
string brand = "";
float ounces;
boolean caffeinated = true;
} Acquire SchemaID from Schema Repo Request Id for: record Coffee {
brand = null;
float ounces;
boolean caffeinated = true;
} Response: "1" Send all Messages Tagged with ID id = "1" Consume Data with Schema: For each record from Kafka: Read the schema Id tag
Look up the schema in the Repository (cacheable)
Generate an Avro schema reader for each reader-writer schema pair, as needed.
Read the payload with the appropriate reader. record Coffee {
float ounces;
boolean caffeinated = true;
string countryOfOrigin = "";
} Result:
Consumers Decoupled from Producers Compatible schema changes decouple Producers from Consumers.
As new Producers come online or are upgraded, they can evolve the schema.
Consumers do not have to change at the same time.
Likewise, Consumers can change the schema they use to interpret the data with no producer changes.
Invalid schema evolution caught by Repository Server validation.
Choose your own schema and validation rules
Full transcript