High Throughput BigData Pub/Sub
- Operate on message batches
- Compress at batch level
- Broker does not track delivery
- Broker retains data on its own terms
- Typically time or space bound
- Removes oldest data first
- Consumers pull from broker
- Streaming or Batch
- Track own progress
Apache Kafka
Schema Evolution on Streaming Data
+
Decoupling High Volume Data
Production and Consumption
Scott Carey
scottcarey@apache.org
Apache Avro
Kafka
Scalable Design
- Data organized by Topic
- Topics can have multiple partitions
- Reads are delivered in consistent order per partition
- Partitions scale horizontally at each level
Producer
Request Id for:
Topic 4
Produce Data with Schema:
record Coffee {
brand = null;
float ounces;
boolean caffeinated = true;
}
record Coffee {
string brand = "";
float ounces;
boolean caffeinated = true;
}
Response:
Topic 3
Kafka Broker
Acquire SchemaID from Schema Repo
Schema ID
Message Payload
Send all Messages Tagged with ID
Topic 2
- Expressive schemas
- Efficient, compact binary serialization
- Fields are not tagged
- More compact
- Potentially faster
- Schema as written must be known at read time
- Code generation is optional
Avro Data Serialization
Avro
Topic 1
Schema Resolution
Read data written with schema A using compatible schema B.
- Ignore data in A not specified in B
- Apply default values to fields in B not present in A
- Promote types
- Reorder fields
Avro
Schema Repository
- Maps Schemas to Schema IDs
- Validates Schema Compatibility
- REST interface
- See AVRO-1124
Consumer
Consume Data with Schema:
- Read the schema Id tag
- Look up the schema in the Repository (cacheable)
- Generate an Avro schema reader for each reader-writer schema pair, as needed.
- Read the payload with the appropriate reader.
For each record from Kafka:
record Coffee {
float ounces;
boolean caffeinated = true;
string countryOfOrigin = "";
}
Result:
Consumers Decoupled from Producers
Compatible schema changes decouple Producers from Consumers.
- As new Producers come online or are upgraded, they can evolve the schema.
- Consumers do not have to change at the same time.
- Likewise, Consumers can change the schema they use to interpret the data with no producer changes.
- Invalid schema evolution caught by Repository Server validation.
- Choose your own schema and validation rules