Pulsar Summit Asia - Structured Data Stream with Apache Pulsar

#PulsarSummit Asia 2020#PulsarSummit Asia 2020
Structured Data Stream with Pulsar
Shivji Kumar Jha
1

●
●
●
●
Who am I ?
https://www.linkedin.com/in/shivjijha/
https://twitter.com/ShivjiJha

Catalogue
• Background: Apache Pulsar
• Background: Schema
• Why Schema
• Introducing Pulsar Schema
• Learnings
• Q&A
3

Background: Apache Pulsar
Pulsar: cloud-native, distributed messaging and streaming platform
5

APACHE PULSAR
6

Highlights:
1. Modular design
2. Horizontally scalable
3. Low latency with durability
4. Multi-tenancy
5. Geo Replication
APACHE PULSAR
7

Background - schema : serialization
Deﬁnitions
1. Imagine you have to send an employee record over network.
2. Cant write as is.
3. Employee encoder to convert employee record to a stream of bytes.
4. Formally, encoding / serialization.
5. Send bytes over network.
https://www.raywenderlich.com/books/swift-apprentice/v6.0/chapters/22-encoding-decoding-types
9

Background - schema : de-serialization
Deﬁnitions
1. When reading from network, turn stream of bytes to employee record.
2. Decoder converts bytes to employee instance.
3. Formally, decoding / de-serialization.
https://www.raywenderlich.com/books/swift-apprentice/v6.0/chapters/22-encoding-decoding-types
10

Background - schema : Schema?
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
1. Encoding can be done in native serialization of programmng language. Examples:
a. Java Serialization
b. Python’s pickle
c. Ruby’s marshal
2. Locked with programming language - oops!
3. Maybe JSON or XML work like web APIs?
a. too verbose
b. storing keys over and over
c. no way to ﬁx types, guess types looking at data. Yuck!
4. Need to save space with each data instance.
5. Also, people stufﬁng random types which other people dont understand.
a. Document well?
6. Ok, lets agree on some protocols and make good documentation of whats allowed and whats
NOT
7. Well, that is what Avro, Protobuff, thrift etc are!
11

Background - schema : Evolution
1. The schema is deﬁned and documented. Great!
2. Someone wants to quickly add a new data type.
a. How does decoder know which schema to use: old or new?
b. Among all schemas how does decoder know two are connected?
i. That is schema versioning for you!
3. Avro, Protobuff, json schema, thrift etc support schema evolution
with versioning.
4. Possible to have sender (producer) and reader (consumer) having
different versions of schema at same time.
12

Background - schema : Avro
1. Encode data with a schema.
2. Ship schema to consumer(?)
a. or Keep schema in a central place keyed with schemaId.
b. Ship schemaId with binary message.
3. While decoding:
a. Get schemaId from beginning of message (always Long?)
b. Fetch schema by schemaId from central schema store.
4. Decode keeping schema and binary data together.
5. Example: Schema tells decoder to expect 4 bytes to convert to an int
13

Schema : no schema?
APACHE
PULSAR
BYTES BYTES
15

Schema : no schema?
APACHE
PULSAR
BYTES BYTES
1. Your data in pulsar store is plain
binary (0s and 1s).
2. Pulsar supports several schema
types for encoding & decoding.
3. Can encode data using schema.
4. Can decode data given schema &
binary data.
16

Schema : no schema?
APACHE
PULSAR
BYTES BYTES
1. Schema or no schema?
2. How do you encode / decode
bytes of pulsar data?
3. If you don’t have a schema, your,
schema is implicit in your app
code!
17

Schema : no schema?
APACHE
PULSAR
BYTES BYTES
1. Schema or no schema?
2. How do you encode / decode
bytes of pulsar data?
3. If you don’t have a schema, your,
schema is implicit in your app
code!
18

Schema : no schema?
https://martinfowler.com/articles/schemaless/#non-uniform-types
19

Schema : no schema?
Add custom
fields for UI
etc
Different attributes
depending on kind of
event
Obviously, easy for
schemaless,
still needs care!
https://martinfowler.com/articles/schemaless/#non-uniform-types
20

Introducing Pulsar Schema : bytes
Domain Object
22

Introducing Pulsar Schema : bytes
Domain Object Byte schema serialized with java
23

Introducing Pulsar Schema : String
Producer
Consumer
24

Introducing Pulsar Schema : All Primitive types
25

Introducing Pulsar Schema : All Primitive types
26

Introducing Pulsar Schema : Structs
(JSON schema)
Domain Object
27

(JSON schema)
Domain Object Producer with JSON schema serialization
28

(AVRO schema)
29

(AVRO schema)
30

(AVRO schema)
Application “knows” which types go to which topic.
31

Pulsar Schema : Schema Store
(Client side)
1. In the previous examples, schema was stored in producer
and consumer object.
2. This is client-side schema storage approach.
32

(Client side)
Problems:
1. Client responsible for:
a. “serializing” data objects (user instance) into bytes
b. “de-serializing” bytes to data object (user instance)
c. “knowing” which types go to which topic.
2. With consumer spread across several micro-services,
“knowing” and “evolving” schema is challenging!
33

(Server side)
Solution:
1. Store schema on a central server.
2. When producing, upload schema to central server.
3. Add schemaId (Long) to message.
4. When consuming, fetch schema with schemaId.
5. Schema management server manages evolution (versioning).
Pulsar has built-in schema registry service!!
34

Pulsar Schema : Schema Registry
1. Entity for schema registry service: schemaInfo .
35

36

1. Each schemaInfo stored with a topic
has a version.
2. SchemaVersion manages schema
changes happening within a topic.
3. Messages produced with a
schemaInfo is tagged with version.
4. Consumer can use schemaVersion to
fetch schemaInfo. Decode message
with schemaInfo.
37

schema Payload structure:
schemaType schemaType;
Boolean isDeleted;
Long timestamp;
String user;
byte[] data;
Hashmap<string, string> props;
38

Admin CLI commands and REST APIs to manage schemas:
39

Learnings
1. Struct schemas (json, avro, protobuff) model domain
objects well.
2. Use byte schema only if really needed.
3. Using avro schemas with pulsar for over an year in
production
a. Json schema is too verbose.
b. proto awesome, still being adopted with sources / sinks
c. avro saves data per message against json schema.
d. avro is very well adopted among source / sinks.
41

Learnings
1. Always a good idea to
think hard and set
compatibility on
namespace.
2. Decide on compatibility
depending on use-case
and expected evolution.
42

Learnings
1. Prefer ordering requirements to decide what goes on
which topic.
a. One domain => one topic!
b. Use AUTO_CONSUME on consumer schema type.
c. schema-autoupdate-strategy = NONE.
2. The schema management process that works for us is:
a. Keep a github repo with schemas.
b. use code reviews to review schema changes.
c. Generate POJOs from avro using maven plugin (java).
d. Add the pojo library as a dependency to micro-services to import
domain objects. 43

References
1. Pulsar docs: https://pulsar.apache.org/docs/en/schema-get-started/
2. Schema auto update strategy:
https://pulsar.apache.org/docs/en/pulsar-admin/#set-schema-autoupdate-s
trategy
3. Schema Evolution in Avro, Thrift, Protobuff:
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-proto
col-buffers-thrift.html
4. Topic design per domain:
https://www.conﬂuent.io/blog/put-several-event-types-kafka-topic/
5. Schema Compatibility Design:
https://docs.conﬂuent.io/platform/current/schema-registry/avro.html#comp
atibility-types 44

Staying Connected:
●
○
○
●
○
●
○ https://twitter.com/ShivjiJha
○ https://www.linkedin.com/in/shivjijha/
Q & A
45

Pulsar Summit Asia - Structured Data Stream with Apache Pulsar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pulsar Summit Asia - Structured Data Stream with Apache Pulsar

Similar to Pulsar Summit Asia - Structured Data Stream with Apache Pulsar (20)

More from Shivji Kumar Jha

More from Shivji Kumar Jha (16)

Recently uploaded

Recently uploaded (20)

Pulsar Summit Asia - Structured Data Stream with Apache Pulsar