SlideShare a Scribd company logo
1 of 50
Download to read offline
www.scling.com
Schema management
with Scalameta
Func prog Sweden Meetup, 2023-10-10
Lars Albertsson
Scling
1
www.scling.com
Who’s talking?
...
Google - video conference, productivity
...
Spotify - data engineering
...
Independent data engineering consultant
Banks, media, startups, heavy industry, telco
Founder @ Scling - data-factory-as-a-service
2
www.scling.com
100x 100x
Data artifacts produced
Manual, mechanised, industrialised
3
Spotify's pipelines ~2013
www.scling.com
Craft vs industry
4
● Each step steered by human
○ Or primitive automation
● Improving artifacts
● Craft is primary competence
● Components made for humans
○ Look nice, "easy to use"
○ Most popular
● Autonomous processes
● Improving process that creates artifacts
● Multitude of competences
● Some components unusable by humans
○ Hard, greasy
○ Made for integration
○ Less popular
● Processes often include craft steps
www.scling.com
Road towards industrialisation
5
Data warehouse age -
mechanised analytics
DW
LAMP stack age -
manual analytics
Hadoop age -
industrialised analytics,
data-fed features,
machine learning
Significant change in workflows
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
www.scling.com
Simplifying use of new technology
6
DW
Enterprise big data failures
"Modern data stack" -
traditional workflows, new technology
Low-code, no-code
www.scling.com
We have seen this before
7
Difficult adoption
4GL, UML, low-code, no-code
Software engineering education
www.scling.com
Data engineering in the future
8
DW
~10 year capability gap
"data factory engineering"
Enterprise big data failures
"Modern data stack" -
traditional workflows, new technology
4GL / UML phase of data engineering
Data engineering education
www.scling.com
Value of data factories
● Factory value ~ robot movements.
● Data factory value ~ number of datasets.
● Differences are orders of magnitude
○ Enterprises: 100s / day
○ Spotify: 100Ks / day
○ Google: Bs / day
● Cost of DataOps - value of data
○ Low cost - further out the long tail
9
Disruptive value of data, ML
Traditional data warehouse reporting
www.scling.com
Data-factory-as-a-service
10
Data lake
● Data factory
○ Collected, raw data →
processed, valuable data
● Data pipelines customised for client
○ Analytics (BI, reports, A/B testing)
○ Data-fed features (autocomplete, search)
○ Learning systems (recommendations, fraud)
● Compete with data leaders:
○ Quick idea-to-production
○ Operational efficiency
{....}
{....}
{....}
www.scling.com
Success in data factories
11
● Work driven by use cases
○ Teams aligned along the value chain
● Minimal data innovation friction
○ Data democratised - accessible and usable
○ Quick code / test / debug / deploy feedback cycle
● High value / non-value work ratio
○ Guard rails to maintain speed without risk
■ Dev tooling, tests, quality metrics
○ Minimal operational toil
● Build on software engineering processes
○ Composability
○ DevOps, everything as code
○ Strong CI/CD process
www.scling.com
Data lake
Transformation
Cold
pond
Data factories
12
Mutation
Immutable,
shareable
www.scling.com
Data lake
Transformation
Cold
pond
13
Mutation
Immutable,
shareable
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
DataOps, functional workflows:
● Immutable, shared data
● Resilient to failure
● Quick error recovery
● Low-risk experiments
Data factories
www.scling.com
● Lowest common denominator = name, type, required
○ Types: string, long, double, binary, array, map, union, record
● Schema specification may support additional constraints, e.g. integer range, other collections
What is a schema?
14
Id Name Age Phone
1 "Anna" 34 null
2 "Bob" 42 "08-123456"
Fields
Name Type Required?
In RDBMS, relations are explicit
In lake/stream datasets, relations are implicit
www.scling.com
Schema definitions
15
{
"type" : "record",
"namespace" : "com.mapflat.example",
"name" : "User",
"fields" : [
{ "name" : "id" , "type" : "int" },
{ "name" : "name" , "type" : "string" },
{ "name" : "age" , "type" : "int" },
{ "name" : "phone" , "type" : ["null", "string"],
"default": null }
]
}
● RDBMS: Table metadata
● Avro: JSON/DSL definition
○ Definition is bundled with avro data files
● Parquet
● pyschema / dataclass
● Scala case classes
● JSON-schema
● JSON: Each record
○ One record insufficient to deduce schema
{ "id": 1, "name": "Alice", "age": "34" }
{ "id": 1, "name": "Bob", "age": "42", "phone": "08-123456" }
case class User(id: String, name: String, age: Int,
phone: Option[String] = None)
val users = Seq( User("1", "Alice", 32),
User("2", "Bob", 43, Some("08-123456")))
www.scling.com
Avro serialisation
16
{
"type" : "record",
"namespace" : "com.mapflat.example",
"name" : "User",
"fields" : [
{ "name" : "Id" , "type" : "int" },
{ "name" : "Name" , "type" : "string" },
{ "name" : "Age" , "type" : "int" },
{ "name" : "Phone" , "type" : ["null", "string"],
"default": null }
]
}
{ "Id": 1, "Name": "Alice", "Age": "34" }
{ "Id": 1, "Name": "Bob", "Age": "42", "Phone": "08-123456" }
User("1", "Alice", 32)
User("2", "Bob", 43, Some("08-123456"))
Write JSON
Write binary
1001001110101010010101010010111010101010100100111101010
1010101001010101011010101010110111010101010100110010101
Read JSON
Read binary
requires schema
User("1", "Alice", 32)
User("2", "Bob", 43, Some("08-123456"))
www.scling.com
Avro in files
17
{
"type" : "record",
"namespace" : "com.mapflat.example",
"name" : "User",
"fields" : [
{ "name" : "Id" , "type" : "int" },
{ "name" : "Name" , "type" : "string" },
{ "name" : "Age" , "type" : "int" },
{ "name" : "Phone" , "type" : ["null", "string"],
"default": null }
]
}
{ "Id": 1, "Name": "Alice", "Age": "34" }
{ "Id": 1, "Name": "Bob", "Age": "42", "Phone": "08-123456" }
(JSON, line
separated) Schema header +
binary records
1001001110101010010101010010111010101010100100111101010
1010101001010101011010101010110111010101010100110010101
User("1", "Alice", 32)
User("2", "Bob", 43, Some("08-123456"))
www.scling.com
Schema on write
18
● Schema defined by writer
● Destination (table / dataset / stream topic) has defined schema
○ Technical definition with metadata (e.g. RDMBS, Kafka + registry)
○ By convention
● Writes not in compliance are not accepted
○ Technically aborted (e.g. RDBMS)
○ In violation of intent (e.g. HDFS datasets)
● Can be technically enforced by producer driver
○ Through ORM / code generation
○ Schema registry lookup
Strict checking philosophy
www.scling.com
Schema on read
19
● Anything (technically) accepted when writing
● Schema defined by reader, at consumption
○ When joining or filtering, unknown fields go through
● Reader may impose requirements on type & value
● Violations of constraints are detected at read
○ Perhaps long after production?
○ By team not owning producer?
Loose checking philosophy
hdfs dfs -cat part-00000.json | jq -c '. | select(.country == "SE")'
www.scling.com
Dynamic vs static typing
20
Schema on write Schema on read
Static typing Dynamic typing
Strict Loose
Possible
Java:
user.setName("Alice");
user2.getName();
Scala:
user = User(name = "Alice", ...)
user2.name
Java:
user.set("name", "Alice");
user2.get("name");
Python:
user.name = "Alice"
user2.name
www.scling.com
Schema on read or write?
21
DB
DB
DB
Service
Service
Export
Business
intelligence
Change agility important here
Production stability important here
www.scling.com
Backward:
E.g. add optional field
Compatibility
22
New consumer
Forward:
E.g. remove optional field
Old producer
= old datasets
Old consumer
New producer
www.scling.com
Incompatible changes
23
● Jobs can accept old input formats
● Translate to new format internally
● Or rerun computations from raw data
● Dedicated job to translate old to new
format - upcasting.
www.scling.com
Language choice, people & preferences
● Java most popular in engineering
+ Good engineering ecosystem
- Boilerplate bad for prototyping & enthusiasm
● Python popular among data scientists
+ Great AI libraries, quick prototypes
- Dynamic typing
● SQL popular among analysts
○ Error handling & data quality difficult
○ Does not compose and scale
● Scala connects both worlds
○ Home of data engineering innovation during the big data hype
24
● Static typing
● Error handling
● Monitor, debug, profile
ecosystem
● Rapid prototyping
● No rituals & boilerplate
● Science innovation
www.scling.com
Scalameta
● Parsing and analysis of scala
source code
25
val a = b() + 3
["val", " ", "a", " ", "=", " ", "b",
"(", ")", " ", "+", " ", "3"]
[val, "a", =, Call("b"), +, Int(3)]
[val, Int(a), =,
Call(com.scling.func.b), +, Int(3)]
lex
parse
semantic
analysis
www.scling.com
Scalameta use cases
● Scalafmt
● Scalafix
○ Static analysis
○ Code transformation
● Online code generation - macros
● Offline code generation
26
www.scling.com
Scalameta use cases
● Scalafmt
● Scalafix
○ Static analysis
○ Code transformation
● Online code generation - macros
● Offline code generation
27
// Example from scio 0.7 -> 0.8 upgrade rules
final class FixTensorflow extends SemanticRule("FixTensorflow") {
override def fix(implicit doc: SemanticDocument): Patch =
doc.tree.collect {
case t @ Term.Select(s, Term.Name(
"saveAsTfExampleFile")) =>
Patch.replaceTree(t, q"$s.saveAsTfRecordFile".syntax)
}.asPatch
}
www.scling.com
Schema & syntax tree
28
Defn.Class(
List(Mod.Annot(Init(Type.Name("PrivacyShielded"), , List())), case),
Type.Name("SaleTransaction"),
List(),
Ctor.Primary(
List(),
,
List(
List(
Term.Param(
List(Mod.Annot(Init(Type.Name("PersonalId"), , List()))),
Term.Name("customerClubId"),
Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))),
None
),
Term.Param(
List(Mod.Annot(Init(Type.Name("PersonalData"), , List()))),
Term.Name("storeId"),
Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))),
None
),
Term.Param(
List(),
Term.Name("item"),
Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))),
None
),
Term.Param(List(), Term.Name("timestamp"), Some(Type.Name("String")), None)
)
)
),
Template(List(), List(), Self(, None), List()))
@PrivacyShielded
case class SaleTransaction(
@PersonalId customerClubId: Option[String],
@PersonalData storeId: Option[String],
item: Option[String],
timestamp: String
)
www.scling.com
Quasiquotes
29
val stat: Stat = "val a = b() + 3".parse[Stat].get
val stat: Stat = q"val a = b() + 3"
www.scling.com
Quasiquotes matching
30
val stat: Stat = "val a = b() + 3".parse[Stat].get
val stat: Stat = q"val a = b() + 3"
stat match {
case q"val $name: $typeOpt = $expr" =>
println(s"Got val declaration $name of type" +
s"${typeOpt.structure} : ${expr.structure} ")
}
www.scling.com
● Expressive
● Custom types
● IDE support
● Avro for data lake storage
Schema definition choice
31
● RDBMS: Table metadata
● Avro: JSON/DSL definition
○ Definition is bundled with avro data files
● Parquet
● pyschema / dataclass
● Scala case classes
● JSON-schema
● JSON: Each record
○ One record insufficient to deduce schema
case class User(id: String, name: String, age: Int,
phone: Option[String] = None)
val users = Seq( User("1", "Alice", 32),
User("2", "Bob", 43, Some("08-123456")))
www.scling.com
Schema offspring Test record
difference render
type classes
32
case classes
test equality
type classes
Avro
definitions
Java Avro
codec classes
Java <-> Scala
converters
Avro type
annotations
MySQL
schemas
CSV codecs
Privacy by
design
machinery
Python
Logical types
www.scling.com
Avro codecs
33
case classes
Avro
definitions
Java Avro
codec classes
Java <-> Scala
converters
{
"name": "JavaUser",
{ "name": "age", "type": "int" }
{ "name": "phone", "type": [ "null", "string" ] }
}
public class JavaUser implements SpecificRecord {
public Integer getAge() { ... }
public String getPhone() { ... }
}
object UserConverter extends AvroConverter[User] {
def fromSpecific(u: JavaUser): User
def toSpecific(u: User): JavaUser
}
case class User(age: Int,
phone: Option[String] = None)
www.scling.com
Quasiquotes in practice
34
q"""
object $converterName extends AvroConverter[${srcClass.clazz.name}
] {
import RecordFieldConverters._
type S = $jClassName
def schema: Schema = $javaClassTerm.getClassSchema()
def tag: ClassTag[S] = implicitly[ClassTag[S]]
def datumReader: SpecificDatumReader[S] = new SpecificDatumReader[$jClassName](classOf[$jClassName])
def datumWriter: SpecificDatumWriter[S] = new SpecificDatumWriter[$jClassName](classOf[$jClassName])
def fromSpecific(record: $jClassName): ${srcClass.clazz.name} =
${Term.Name(srcClass.clazz.name.value)}
(..$fromInits )
def toSpecific(record: ${srcClass.clazz.name}
): $jClassName =
new $jClassName(..$specificArgs)
}
"""
www.scling.com
CSV codecs
35
case classes
CSV codecs
import kantan.csv._
object CsvCodecs {
implicit val userDecoder: HeaderDecoder[User] = ...
}
case class User(age: Int,
phone: Option[String] = None)
www.scling.com
Test equality Test record
difference render
type classes
36
case classes
test equality
type classes
trait REquality[ T] { def equal(value: T, right: T): Boolean }
object REquality {
implicit val double: REquality[Double] = new REquality[Double] {
def equal(left: Double, right: Double): Boolean = {
// Use a combination of absolute and relative tolerance
left === right +- 1e-5.max(left.abs * 1e-5).max(right.abs * 1e-5)
}
}
/** binds the Magnolia macro to the `gen` method */
implicit def gen[T]: REquality[ T] = macro Magnolia. gen[T]
}
object Equalities {
implicit val equalityUser: REquality[User] =
REquality. gen[User]
}
www.scling.com
case class User(
age: Int,
@AvroProp ("sqlType", "varchar(1012)")
phone: Option[String] = None)
Python + RDBMS
37
case classes
Avro
definitions
Avro type
annotations
MySQL
schemas
Python
{
"name": "User",
{ "name": "age", "type": "int" }
{ "name": "phone",
"type": [ "null", "string" ],
"sqlType": "varchar(1012)",
}
}
class UserEgressJob(CopyToTable):
columns = [
( "age", "int"),
( "name", "varchar(1012)"),
]
...
www.scling.com
Logical types
38
case classes
Logical types
case t"""Instant""" =>
JObject(List(JField("type", JString("long")), JField("logicalType",
JString("timestamp-micros"))))
case t"""LocalDate""" => JObject(List(JField("type", JString("int")),
JField("logicalType", JString("date"))))
case t"""YearMonth""" => JObject(List(JField("type", JString("int"))))
case t"""JObject""" => JString("string")
● Avro logical types
○ E.g. date → int, timestamp → long
○ Default is timestamp-millis
■ Great for year > 294441 (!)
● Custom logical types
○ Time
○ Collections
○ Physical
www.scling.com
Stretching the type system
39
● Fail: mixup kW and kWh
● Could be a compile-time error. Should be.
● Physical dimension libraries
○ Boost.Units - C++
○ Coloumb - Scala
www.scling.com
Data
lake
Private
pond
Cold
store
Ingest prepared for deletion
40
Mutation
Landing
pond
Append +
delete
Immutable,
limited
retention
www.scling.com
● PII fields encrypted
● Per-user decryption key table
● Clear single user key => oblivion
- Extra join + decrypt
- Decryption (user) id needed
+ Multi-field oblivion
+ Single dataset leak → no PII leak
+ Handles transformed PII fields
Lost key pattern
41
www.scling.com
Shieldformation
42
@PrivacyShielded
case class Sale(
@PersonalId customerClubId: Option[String],
@PersonalData storeId: Option[String],
item: Option[String],
timestamp: String
)
case class SaleShielded(
shieldId: Option[String],
customerClubIdEncrypted: Option[String],
storeIdEncrypted: Option[String],
item: Option[String],
timestamp: String
)
case class SaleAnonymous(
item: Option[String],
timestamp: String
)
object SaleAnonymize extends SparkJob {
...
}
ShieldForm
object SaleExpose extends SparkJob {
...
}
object SaleShield extends SparkJob {
...
}
case class Shield(
shieldId: String,
personId: Option[String],
keyStr: Option[String],
encounterDate: String
)
www.scling.com
Shield
Shieldformation & lost key
43
SaleShield
Sale
Sale
Shielded
Shield
Deletion
requests
Customer
History
Exposed egress
SaleExpose
Limited retention
SaleAnonymize
Sale
Anonymous
Sale
Stats
www.scling.com
Access limitation models
44
Scramble at ingress (naive)
Validate & filter at access (enterprise)
Locked lake (engineered)
Federated locking (scalable)
www.scling.com
Shield
Shieldformation & locked lake
45
SaleShield
Sale
Sale
Shielded
Shield
Deletion
requests
Customer
History
Exposed egress
SaleExpose
Limited retention
SaleAnonymize
Sale
Anonymous
Sale
Stats
Owned by
target system
www.scling.com
Schema on write!
46
DB
DB
DB
Service
Service
Export
Business
intelligence
Change agility important here
Production stability important here
www.scling.com
Success in data factories
47
● Work driven by use cases
○ Teams aligned along the value chain
● Minimal data innovation friction
○ Data democratised - accessible and usable
○ Quick code / test / debug / deploy feedback cycle
● High value / non-value work ratio
○ Guard rails to maintain speed without risk
■ Dev tooling, tests, quality metrics
○ Minimal operational toil
● Build on software engineering processes
○ Composability
○ DevOps, everything as code
○ Strong CI/CD process
www.scling.com
● Work driven by use cases
○ Teams aligned along the value chain
● Minimal data innovation friction
○ Data democratised - accessible and usable
○ Quick code / test / debug / deploy feedback cycle
● High value / non-value work ratio
○ Guard rails to maintain speed without risk
■ Dev tooling, tests, quality metrics
○ Minimal operational toil
● Build on software engineering processes
○ Composability
○ DevOps, everything as code
○ Strong CI/CD process
Success in data factories vs data trends
48
Data
mesh
Data
contracts
No code /
low code
SQL / Data
warehouses
www.scling.com
Data factory track record
49
Time to
first flow
Staff size 1st flow
effort, weeks
1st flow cost
(w * 50K ?)
Time to
innovation
Flows 1y
after first
Media 1+ years 10-30 1500? 100M (0.5-1B) 1+ year ?
Finance 2 years 10-50 2000? 100M? Years 10?
Media 3 weeks 4.5 - 8 15 750K 3 months 30
Retail 7 weeks 1-3 7 500K * 6 months 70
Telecom 12 weeks 2-5 30 1500K 6 months 50
Consumer
products
20+ weeks 1.5 30+ 1200+K 6+ months 20
Construction 8 weeks 0.5 4 150K * 7 months 10
Manufacturing 8 weeks 0.5 4 200K * 6 months ?
www.scling.com
● Is Shieldformation open source?
○ No. It might be when it is older and less volatile, and we have grown enough to maintain it properly.
● Is cryptoshredding really acceptable as deletion?
○ Yes.
● Is lost key pattern legally sufficient for all use cases?
○ No.
○ It does not provide complete anonymisation, but pseudonymisation with a limited time span (1 month).
○ Unless data is very sensitive, it has been deemed legally sufficient.
○ Be careful with health & geo data.
Q & A?
50

More Related Content

Similar to Schema management with Scalameta

A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...VMware Tanzu
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusBoldRadius Solutions
 
Empowering the AWS DynamoDB™ application developer with Alternator
Empowering the AWS DynamoDB™ application developer with AlternatorEmpowering the AWS DynamoDB™ application developer with Alternator
Empowering the AWS DynamoDB™ application developer with AlternatorScyllaDB
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamNeville Li
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
 
Adam_Mcconnell_Fall_2012_2nd_revision
Adam_Mcconnell_Fall_2012_2nd_revisionAdam_Mcconnell_Fall_2012_2nd_revision
Adam_Mcconnell_Fall_2012_2nd_revisionAdam McConnell
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News! ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News! Embarcadero Technologies
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...Facultad de Informática UCM
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightScyllaDB
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015NoSQLmatters
 
Rafael Bagmanov «Scala in a wild enterprise»
Rafael Bagmanov «Scala in a wild enterprise»Rafael Bagmanov «Scala in a wild enterprise»
Rafael Bagmanov «Scala in a wild enterprise»e-Legion
 
Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Anuj Sahni
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dwelephantscale
 

Similar to Schema management with Scalameta (20)

A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 
Empowering the AWS DynamoDB™ application developer with Alternator
Empowering the AWS DynamoDB™ application developer with AlternatorEmpowering the AWS DynamoDB™ application developer with Alternator
Empowering the AWS DynamoDB™ application developer with Alternator
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Adam_Mcconnell_Fall_2012_2nd_revision
Adam_Mcconnell_Fall_2012_2nd_revisionAdam_Mcconnell_Fall_2012_2nd_revision
Adam_Mcconnell_Fall_2012_2nd_revision
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News! ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
 
Rafael Bagmanov «Scala in a wild enterprise»
Rafael Bagmanov «Scala in a wild enterprise»Rafael Bagmanov «Scala in a wild enterprise»
Rafael Bagmanov «Scala in a wild enterprise»
 
Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 

More from Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platformLars Albertsson
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science teamLars Albertsson
 

More from Lars Albertsson (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Data democratised
Data democratisedData democratised
Data democratised
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 
Privacy by design
Privacy by designPrivacy by design
Privacy by design
 

Recently uploaded

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 

Recently uploaded (20)

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 

Schema management with Scalameta

  • 1. www.scling.com Schema management with Scalameta Func prog Sweden Meetup, 2023-10-10 Lars Albertsson Scling 1
  • 2. www.scling.com Who’s talking? ... Google - video conference, productivity ... Spotify - data engineering ... Independent data engineering consultant Banks, media, startups, heavy industry, telco Founder @ Scling - data-factory-as-a-service 2
  • 3. www.scling.com 100x 100x Data artifacts produced Manual, mechanised, industrialised 3 Spotify's pipelines ~2013
  • 4. www.scling.com Craft vs industry 4 ● Each step steered by human ○ Or primitive automation ● Improving artifacts ● Craft is primary competence ● Components made for humans ○ Look nice, "easy to use" ○ Most popular ● Autonomous processes ● Improving process that creates artifacts ● Multitude of competences ● Some components unusable by humans ○ Hard, greasy ○ Made for integration ○ Less popular ● Processes often include craft steps
  • 5. www.scling.com Road towards industrialisation 5 Data warehouse age - mechanised analytics DW LAMP stack age - manual analytics Hadoop age - industrialised analytics, data-fed features, machine learning Significant change in workflows Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations
  • 6. www.scling.com Simplifying use of new technology 6 DW Enterprise big data failures "Modern data stack" - traditional workflows, new technology Low-code, no-code
  • 7. www.scling.com We have seen this before 7 Difficult adoption 4GL, UML, low-code, no-code Software engineering education
  • 8. www.scling.com Data engineering in the future 8 DW ~10 year capability gap "data factory engineering" Enterprise big data failures "Modern data stack" - traditional workflows, new technology 4GL / UML phase of data engineering Data engineering education
  • 9. www.scling.com Value of data factories ● Factory value ~ robot movements. ● Data factory value ~ number of datasets. ● Differences are orders of magnitude ○ Enterprises: 100s / day ○ Spotify: 100Ks / day ○ Google: Bs / day ● Cost of DataOps - value of data ○ Low cost - further out the long tail 9 Disruptive value of data, ML Traditional data warehouse reporting
  • 10. www.scling.com Data-factory-as-a-service 10 Data lake ● Data factory ○ Collected, raw data → processed, valuable data ● Data pipelines customised for client ○ Analytics (BI, reports, A/B testing) ○ Data-fed features (autocomplete, search) ○ Learning systems (recommendations, fraud) ● Compete with data leaders: ○ Quick idea-to-production ○ Operational efficiency {....} {....} {....}
  • 11. www.scling.com Success in data factories 11 ● Work driven by use cases ○ Teams aligned along the value chain ● Minimal data innovation friction ○ Data democratised - accessible and usable ○ Quick code / test / debug / deploy feedback cycle ● High value / non-value work ratio ○ Guard rails to maintain speed without risk ■ Dev tooling, tests, quality metrics ○ Minimal operational toil ● Build on software engineering processes ○ Composability ○ DevOps, everything as code ○ Strong CI/CD process
  • 13. www.scling.com Data lake Transformation Cold pond 13 Mutation Immutable, shareable Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations DataOps, functional workflows: ● Immutable, shared data ● Resilient to failure ● Quick error recovery ● Low-risk experiments Data factories
  • 14. www.scling.com ● Lowest common denominator = name, type, required ○ Types: string, long, double, binary, array, map, union, record ● Schema specification may support additional constraints, e.g. integer range, other collections What is a schema? 14 Id Name Age Phone 1 "Anna" 34 null 2 "Bob" 42 "08-123456" Fields Name Type Required? In RDBMS, relations are explicit In lake/stream datasets, relations are implicit
  • 15. www.scling.com Schema definitions 15 { "type" : "record", "namespace" : "com.mapflat.example", "name" : "User", "fields" : [ { "name" : "id" , "type" : "int" }, { "name" : "name" , "type" : "string" }, { "name" : "age" , "type" : "int" }, { "name" : "phone" , "type" : ["null", "string"], "default": null } ] } ● RDBMS: Table metadata ● Avro: JSON/DSL definition ○ Definition is bundled with avro data files ● Parquet ● pyschema / dataclass ● Scala case classes ● JSON-schema ● JSON: Each record ○ One record insufficient to deduce schema { "id": 1, "name": "Alice", "age": "34" } { "id": 1, "name": "Bob", "age": "42", "phone": "08-123456" } case class User(id: String, name: String, age: Int, phone: Option[String] = None) val users = Seq( User("1", "Alice", 32), User("2", "Bob", 43, Some("08-123456")))
  • 16. www.scling.com Avro serialisation 16 { "type" : "record", "namespace" : "com.mapflat.example", "name" : "User", "fields" : [ { "name" : "Id" , "type" : "int" }, { "name" : "Name" , "type" : "string" }, { "name" : "Age" , "type" : "int" }, { "name" : "Phone" , "type" : ["null", "string"], "default": null } ] } { "Id": 1, "Name": "Alice", "Age": "34" } { "Id": 1, "Name": "Bob", "Age": "42", "Phone": "08-123456" } User("1", "Alice", 32) User("2", "Bob", 43, Some("08-123456")) Write JSON Write binary 1001001110101010010101010010111010101010100100111101010 1010101001010101011010101010110111010101010100110010101 Read JSON Read binary requires schema User("1", "Alice", 32) User("2", "Bob", 43, Some("08-123456"))
  • 17. www.scling.com Avro in files 17 { "type" : "record", "namespace" : "com.mapflat.example", "name" : "User", "fields" : [ { "name" : "Id" , "type" : "int" }, { "name" : "Name" , "type" : "string" }, { "name" : "Age" , "type" : "int" }, { "name" : "Phone" , "type" : ["null", "string"], "default": null } ] } { "Id": 1, "Name": "Alice", "Age": "34" } { "Id": 1, "Name": "Bob", "Age": "42", "Phone": "08-123456" } (JSON, line separated) Schema header + binary records 1001001110101010010101010010111010101010100100111101010 1010101001010101011010101010110111010101010100110010101 User("1", "Alice", 32) User("2", "Bob", 43, Some("08-123456"))
  • 18. www.scling.com Schema on write 18 ● Schema defined by writer ● Destination (table / dataset / stream topic) has defined schema ○ Technical definition with metadata (e.g. RDMBS, Kafka + registry) ○ By convention ● Writes not in compliance are not accepted ○ Technically aborted (e.g. RDBMS) ○ In violation of intent (e.g. HDFS datasets) ● Can be technically enforced by producer driver ○ Through ORM / code generation ○ Schema registry lookup Strict checking philosophy
  • 19. www.scling.com Schema on read 19 ● Anything (technically) accepted when writing ● Schema defined by reader, at consumption ○ When joining or filtering, unknown fields go through ● Reader may impose requirements on type & value ● Violations of constraints are detected at read ○ Perhaps long after production? ○ By team not owning producer? Loose checking philosophy hdfs dfs -cat part-00000.json | jq -c '. | select(.country == "SE")'
  • 20. www.scling.com Dynamic vs static typing 20 Schema on write Schema on read Static typing Dynamic typing Strict Loose Possible Java: user.setName("Alice"); user2.getName(); Scala: user = User(name = "Alice", ...) user2.name Java: user.set("name", "Alice"); user2.get("name"); Python: user.name = "Alice" user2.name
  • 21. www.scling.com Schema on read or write? 21 DB DB DB Service Service Export Business intelligence Change agility important here Production stability important here
  • 22. www.scling.com Backward: E.g. add optional field Compatibility 22 New consumer Forward: E.g. remove optional field Old producer = old datasets Old consumer New producer
  • 23. www.scling.com Incompatible changes 23 ● Jobs can accept old input formats ● Translate to new format internally ● Or rerun computations from raw data ● Dedicated job to translate old to new format - upcasting.
  • 24. www.scling.com Language choice, people & preferences ● Java most popular in engineering + Good engineering ecosystem - Boilerplate bad for prototyping & enthusiasm ● Python popular among data scientists + Great AI libraries, quick prototypes - Dynamic typing ● SQL popular among analysts ○ Error handling & data quality difficult ○ Does not compose and scale ● Scala connects both worlds ○ Home of data engineering innovation during the big data hype 24 ● Static typing ● Error handling ● Monitor, debug, profile ecosystem ● Rapid prototyping ● No rituals & boilerplate ● Science innovation
  • 25. www.scling.com Scalameta ● Parsing and analysis of scala source code 25 val a = b() + 3 ["val", " ", "a", " ", "=", " ", "b", "(", ")", " ", "+", " ", "3"] [val, "a", =, Call("b"), +, Int(3)] [val, Int(a), =, Call(com.scling.func.b), +, Int(3)] lex parse semantic analysis
  • 26. www.scling.com Scalameta use cases ● Scalafmt ● Scalafix ○ Static analysis ○ Code transformation ● Online code generation - macros ● Offline code generation 26
  • 27. www.scling.com Scalameta use cases ● Scalafmt ● Scalafix ○ Static analysis ○ Code transformation ● Online code generation - macros ● Offline code generation 27 // Example from scio 0.7 -> 0.8 upgrade rules final class FixTensorflow extends SemanticRule("FixTensorflow") { override def fix(implicit doc: SemanticDocument): Patch = doc.tree.collect { case t @ Term.Select(s, Term.Name( "saveAsTfExampleFile")) => Patch.replaceTree(t, q"$s.saveAsTfRecordFile".syntax) }.asPatch }
  • 28. www.scling.com Schema & syntax tree 28 Defn.Class( List(Mod.Annot(Init(Type.Name("PrivacyShielded"), , List())), case), Type.Name("SaleTransaction"), List(), Ctor.Primary( List(), , List( List( Term.Param( List(Mod.Annot(Init(Type.Name("PersonalId"), , List()))), Term.Name("customerClubId"), Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))), None ), Term.Param( List(Mod.Annot(Init(Type.Name("PersonalData"), , List()))), Term.Name("storeId"), Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))), None ), Term.Param( List(), Term.Name("item"), Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))), None ), Term.Param(List(), Term.Name("timestamp"), Some(Type.Name("String")), None) ) ) ), Template(List(), List(), Self(, None), List())) @PrivacyShielded case class SaleTransaction( @PersonalId customerClubId: Option[String], @PersonalData storeId: Option[String], item: Option[String], timestamp: String )
  • 29. www.scling.com Quasiquotes 29 val stat: Stat = "val a = b() + 3".parse[Stat].get val stat: Stat = q"val a = b() + 3"
  • 30. www.scling.com Quasiquotes matching 30 val stat: Stat = "val a = b() + 3".parse[Stat].get val stat: Stat = q"val a = b() + 3" stat match { case q"val $name: $typeOpt = $expr" => println(s"Got val declaration $name of type" + s"${typeOpt.structure} : ${expr.structure} ") }
  • 31. www.scling.com ● Expressive ● Custom types ● IDE support ● Avro for data lake storage Schema definition choice 31 ● RDBMS: Table metadata ● Avro: JSON/DSL definition ○ Definition is bundled with avro data files ● Parquet ● pyschema / dataclass ● Scala case classes ● JSON-schema ● JSON: Each record ○ One record insufficient to deduce schema case class User(id: String, name: String, age: Int, phone: Option[String] = None) val users = Seq( User("1", "Alice", 32), User("2", "Bob", 43, Some("08-123456")))
  • 32. www.scling.com Schema offspring Test record difference render type classes 32 case classes test equality type classes Avro definitions Java Avro codec classes Java <-> Scala converters Avro type annotations MySQL schemas CSV codecs Privacy by design machinery Python Logical types
  • 33. www.scling.com Avro codecs 33 case classes Avro definitions Java Avro codec classes Java <-> Scala converters { "name": "JavaUser", { "name": "age", "type": "int" } { "name": "phone", "type": [ "null", "string" ] } } public class JavaUser implements SpecificRecord { public Integer getAge() { ... } public String getPhone() { ... } } object UserConverter extends AvroConverter[User] { def fromSpecific(u: JavaUser): User def toSpecific(u: User): JavaUser } case class User(age: Int, phone: Option[String] = None)
  • 34. www.scling.com Quasiquotes in practice 34 q""" object $converterName extends AvroConverter[${srcClass.clazz.name} ] { import RecordFieldConverters._ type S = $jClassName def schema: Schema = $javaClassTerm.getClassSchema() def tag: ClassTag[S] = implicitly[ClassTag[S]] def datumReader: SpecificDatumReader[S] = new SpecificDatumReader[$jClassName](classOf[$jClassName]) def datumWriter: SpecificDatumWriter[S] = new SpecificDatumWriter[$jClassName](classOf[$jClassName]) def fromSpecific(record: $jClassName): ${srcClass.clazz.name} = ${Term.Name(srcClass.clazz.name.value)} (..$fromInits ) def toSpecific(record: ${srcClass.clazz.name} ): $jClassName = new $jClassName(..$specificArgs) } """
  • 35. www.scling.com CSV codecs 35 case classes CSV codecs import kantan.csv._ object CsvCodecs { implicit val userDecoder: HeaderDecoder[User] = ... } case class User(age: Int, phone: Option[String] = None)
  • 36. www.scling.com Test equality Test record difference render type classes 36 case classes test equality type classes trait REquality[ T] { def equal(value: T, right: T): Boolean } object REquality { implicit val double: REquality[Double] = new REquality[Double] { def equal(left: Double, right: Double): Boolean = { // Use a combination of absolute and relative tolerance left === right +- 1e-5.max(left.abs * 1e-5).max(right.abs * 1e-5) } } /** binds the Magnolia macro to the `gen` method */ implicit def gen[T]: REquality[ T] = macro Magnolia. gen[T] } object Equalities { implicit val equalityUser: REquality[User] = REquality. gen[User] }
  • 37. www.scling.com case class User( age: Int, @AvroProp ("sqlType", "varchar(1012)") phone: Option[String] = None) Python + RDBMS 37 case classes Avro definitions Avro type annotations MySQL schemas Python { "name": "User", { "name": "age", "type": "int" } { "name": "phone", "type": [ "null", "string" ], "sqlType": "varchar(1012)", } } class UserEgressJob(CopyToTable): columns = [ ( "age", "int"), ( "name", "varchar(1012)"), ] ...
  • 38. www.scling.com Logical types 38 case classes Logical types case t"""Instant""" => JObject(List(JField("type", JString("long")), JField("logicalType", JString("timestamp-micros")))) case t"""LocalDate""" => JObject(List(JField("type", JString("int")), JField("logicalType", JString("date")))) case t"""YearMonth""" => JObject(List(JField("type", JString("int")))) case t"""JObject""" => JString("string") ● Avro logical types ○ E.g. date → int, timestamp → long ○ Default is timestamp-millis ■ Great for year > 294441 (!) ● Custom logical types ○ Time ○ Collections ○ Physical
  • 39. www.scling.com Stretching the type system 39 ● Fail: mixup kW and kWh ● Could be a compile-time error. Should be. ● Physical dimension libraries ○ Boost.Units - C++ ○ Coloumb - Scala
  • 40. www.scling.com Data lake Private pond Cold store Ingest prepared for deletion 40 Mutation Landing pond Append + delete Immutable, limited retention
  • 41. www.scling.com ● PII fields encrypted ● Per-user decryption key table ● Clear single user key => oblivion - Extra join + decrypt - Decryption (user) id needed + Multi-field oblivion + Single dataset leak → no PII leak + Handles transformed PII fields Lost key pattern 41
  • 42. www.scling.com Shieldformation 42 @PrivacyShielded case class Sale( @PersonalId customerClubId: Option[String], @PersonalData storeId: Option[String], item: Option[String], timestamp: String ) case class SaleShielded( shieldId: Option[String], customerClubIdEncrypted: Option[String], storeIdEncrypted: Option[String], item: Option[String], timestamp: String ) case class SaleAnonymous( item: Option[String], timestamp: String ) object SaleAnonymize extends SparkJob { ... } ShieldForm object SaleExpose extends SparkJob { ... } object SaleShield extends SparkJob { ... } case class Shield( shieldId: String, personId: Option[String], keyStr: Option[String], encounterDate: String )
  • 43. www.scling.com Shield Shieldformation & lost key 43 SaleShield Sale Sale Shielded Shield Deletion requests Customer History Exposed egress SaleExpose Limited retention SaleAnonymize Sale Anonymous Sale Stats
  • 44. www.scling.com Access limitation models 44 Scramble at ingress (naive) Validate & filter at access (enterprise) Locked lake (engineered) Federated locking (scalable)
  • 45. www.scling.com Shield Shieldformation & locked lake 45 SaleShield Sale Sale Shielded Shield Deletion requests Customer History Exposed egress SaleExpose Limited retention SaleAnonymize Sale Anonymous Sale Stats Owned by target system
  • 47. www.scling.com Success in data factories 47 ● Work driven by use cases ○ Teams aligned along the value chain ● Minimal data innovation friction ○ Data democratised - accessible and usable ○ Quick code / test / debug / deploy feedback cycle ● High value / non-value work ratio ○ Guard rails to maintain speed without risk ■ Dev tooling, tests, quality metrics ○ Minimal operational toil ● Build on software engineering processes ○ Composability ○ DevOps, everything as code ○ Strong CI/CD process
  • 48. www.scling.com ● Work driven by use cases ○ Teams aligned along the value chain ● Minimal data innovation friction ○ Data democratised - accessible and usable ○ Quick code / test / debug / deploy feedback cycle ● High value / non-value work ratio ○ Guard rails to maintain speed without risk ■ Dev tooling, tests, quality metrics ○ Minimal operational toil ● Build on software engineering processes ○ Composability ○ DevOps, everything as code ○ Strong CI/CD process Success in data factories vs data trends 48 Data mesh Data contracts No code / low code SQL / Data warehouses
  • 49. www.scling.com Data factory track record 49 Time to first flow Staff size 1st flow effort, weeks 1st flow cost (w * 50K ?) Time to innovation Flows 1y after first Media 1+ years 10-30 1500? 100M (0.5-1B) 1+ year ? Finance 2 years 10-50 2000? 100M? Years 10? Media 3 weeks 4.5 - 8 15 750K 3 months 30 Retail 7 weeks 1-3 7 500K * 6 months 70 Telecom 12 weeks 2-5 30 1500K 6 months 50 Consumer products 20+ weeks 1.5 30+ 1200+K 6+ months 20 Construction 8 weeks 0.5 4 150K * 7 months 10 Manufacturing 8 weeks 0.5 4 200K * 6 months ?
  • 50. www.scling.com ● Is Shieldformation open source? ○ No. It might be when it is older and less volatile, and we have grown enough to maintain it properly. ● Is cryptoshredding really acceptable as deletion? ○ Yes. ● Is lost key pattern legally sufficient for all use cases? ○ No. ○ It does not provide complete anonymisation, but pseudonymisation with a limited time span (1 month). ○ Unless data is very sensitive, it has been deemed legally sufficient. ○ Be careful with health & geo data. Q & A? 50