This document discusses Apache Pulsar schemas. It begins with background on Pulsar, serialization, and schema evolution. It then discusses the benefits of using schemas with Pulsar, including different schema types like primitive, JSON, and Avro schemas. It describes how Pulsar uses a schema registry to store schemas on the server side rather than client side. Key learnings are to use structured schemas like Avro to model domain objects, consider compatibility and ordering when designing topics, and manage schemas through a code review process. The document provides references for further reading on Pulsar schemas and schema evolution.
9. Background - schema : serialization
Definitions
1. Imagine you have to send an employee record over network.
2. Cant write as is.
3. Employee encoder to convert employee record to a stream of bytes.
4. Formally, encoding / serialization.
5. Send bytes over network.
https://www.raywenderlich.com/books/swift-apprentice/v6.0/chapters/22-encoding-decoding-types
9
10. Background - schema : de-serialization
Definitions
1. When reading from network, turn stream of bytes to employee record.
2. Decoder converts bytes to employee instance.
3. Formally, decoding / de-serialization.
https://www.raywenderlich.com/books/swift-apprentice/v6.0/chapters/22-encoding-decoding-types
10
11. Background - schema : Schema?
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
1. Encoding can be done in native serialization of programmng language. Examples:
a. Java Serialization
b. Python’s pickle
c. Ruby’s marshal
2. Locked with programming language - oops!
3. Maybe JSON or XML work like web APIs?
a. too verbose
b. storing keys over and over
c. no way to fix types, guess types looking at data. Yuck!
4. Need to save space with each data instance.
5. Also, people stuffing random types which other people dont understand.
a. Document well?
6. Ok, lets agree on some protocols and make good documentation of whats allowed and whats
NOT
7. Well, that is what Avro, Protobuff, thrift etc are!
11
12. Background - schema : Evolution
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
1. The schema is defined and documented. Great!
2. Someone wants to quickly add a new data type.
a. How does decoder know which schema to use: old or new?
b. Among all schemas how does decoder know two are connected?
i. That is schema versioning for you!
3. Avro, Protobuff, json schema, thrift etc support schema evolution
with versioning.
4. Possible to have sender (producer) and reader (consumer) having
different versions of schema at same time.
12
13. Background - schema : Avro
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
1. Encode data with a schema.
2. Ship schema to consumer(?)
a. or Keep schema in a central place keyed with schemaId.
b. Ship schemaId with binary message.
3. While decoding:
a. Get schemaId from beginning of message (always Long?)
b. Fetch schema by schemaId from central schema store.
4. Decode keeping schema and binary data together.
5. Example: Schema tells decoder to expect 4 bytes to convert to an int
13
15. Schema : no schema?
APACHE
PULSAR
BYTES BYTES
15
16. Schema : no schema?
APACHE
PULSAR
BYTES BYTES
1. Your data in pulsar store is plain
binary (0s and 1s).
2. Pulsar supports several schema
types for encoding & decoding.
3. Can encode data using schema.
4. Can decode data given schema &
binary data.
16
17. Schema : no schema?
APACHE
PULSAR
BYTES BYTES
1. Schema or no schema?
2. How do you encode / decode
bytes of pulsar data?
3. If you don’t have a schema, your,
schema is implicit in your app
code!
17
18. Schema : no schema?
APACHE
PULSAR
BYTES BYTES
1. Schema or no schema?
2. How do you encode / decode
bytes of pulsar data?
3. If you don’t have a schema, your,
schema is implicit in your app
code!
18
19. Schema : no schema?
https://martinfowler.com/articles/schemaless/#non-uniform-types
19
20. Schema : no schema?
Add custom
fields for UI
etc
Different attributes
depending on kind of
event
Obviously, easy for
schemaless,
still needs care!
https://martinfowler.com/articles/schemaless/#non-uniform-types
20
31. Introducing Pulsar Schema : Structs
(AVRO schema)
Application “knows” which types go to which topic.
31
32. Pulsar Schema : Schema Store
(Client side)
1. In the previous examples, schema was stored in producer
and consumer object.
2. This is client-side schema storage approach.
32
33. Pulsar Schema : Schema Store
(Client side)
Problems:
1. Client responsible for:
a. “serializing” data objects (user instance) into bytes
b. “de-serializing” bytes to data object (user instance)
c. “knowing” which types go to which topic.
2. With consumer spread across several micro-services,
“knowing” and “evolving” schema is challenging!
33
34. Pulsar Schema : Schema Store
(Server side)
Solution:
1. Store schema on a central server.
2. When producing, upload schema to central server.
3. Add schemaId (Long) to message.
4. When consuming, fetch schema with schemaId.
5. Schema management server manages evolution (versioning).
Pulsar has built-in schema registry service!!
34
37. Pulsar Schema : Schema Registry
1. Each schemaInfo stored with a topic
has a version.
2. SchemaVersion manages schema
changes happening within a topic.
3. Messages produced with a
schemaInfo is tagged with version.
4. Consumer can use schemaVersion to
fetch schemaInfo. Decode message
with schemaInfo.
37
41. Learnings
1. Struct schemas (json, avro, protobuff) model domain
objects well.
2. Use byte schema only if really needed.
3. Using avro schemas with pulsar for over an year in
production
a. Json schema is too verbose.
b. proto awesome, still being adopted with sources / sinks
c. avro saves data per message against json schema.
d. avro is very well adopted among source / sinks.
41
42. Learnings
1. Always a good idea to
think hard and set
compatibility on
namespace.
2. Decide on compatibility
depending on use-case
and expected evolution.
42
43. Learnings
1. Prefer ordering requirements to decide what goes on
which topic.
a. One domain => one topic!
b. Use AUTO_CONSUME on consumer schema type.
c. schema-autoupdate-strategy = NONE.
2. The schema management process that works for us is:
a. Keep a github repo with schemas.
b. use code reviews to review schema changes.
c. Generate POJOs from avro using maven plugin (java).
d. Add the pojo library as a dependency to micro-services to import
domain objects. 43