PS-3C Data Modelling Zone Berlin

PS-3C
A new ensemble modelling
technique

About Me...
‘Head of BI’ @ Spilgames
Certified Data Vault modeler since
2009
Contact details:
 nl.linkedin.com/in/rogierwerschkull
 rogierwerschkull@gmail.com
 @rwerschkull

Ensemble data Modelling?…
@rwerschkull
nl.linkedin.com/in/rogierwerschkull

WHY
Another Ensemble?
We’ve got loads already!
http://topofminds.se/wp/2014/ensemble-modeling-forelasningar-och-lars-ronnback/
@rwerschkull

13600
9970
370
73 5
0
2000
4000
6000
8000
10000
12000
14000
16000
"Data Vault modeling"
+"data warehouse"
"Anchor modeling"
+"data warehouse"
"Hyper agility" +"data
warehouse"
"Focal point modeling"
+"data warehouse"
"Head version modeling"
+"data warehouse"
Search Hits on Google, 31-8-2016
Ensemble
Popularity

Data Vault
Ran into Problems using:
Head-Version
Anchor Modelling
@rwerschkull

WHAT problemS?
Photo: My own……
@rwerschkull

problem
@rwerschkull

Not Build for
BiG data lake
/ Data CentrICITY
Photo credit: Lake Public Domain, http://www.writeups.org/star-trek-brent-spiner-data/
@rwerschkull

‘Data may first be stored in a
data lake so that it can be explored, cleaned, and prepared.
If it can be structured in a relational format (basically rows and columns)
and needs to be used frequently and kept highly secure, it may go into a
data warehouse.
If it stops being used frequently, it may go back to a HDFS
(Hadoop Distributed File System)-based archive.’
Data Centric / data first ??
THOMAS H. DAVENPORT, WALL STEET JOURNAL OF 3-6-2015
http://blogs.wsj.com/cio/2015/06/03/the-shift-to-a-new-data-architecture/ @rwerschkull

Systems Like...
@rwerschkull

Data Flood
Photo credit: Kurayba (https://www.flickr.com/photos/48503330@N08/28564454666/ )
under cc licence (https://creativecommons.org/licenses/by-sa/2.0/)
@rwerschkull

The possible result...…
Photo credit: https://highfiveexports.wordpress.com/2010/06/25/3000-pieces-lego-mix-specialty-pieces-rare-pieces-bricks-blocks-
parts-more-ultimate-lot-of-lego-parts-pieces-lego-for-sale-lego-batman-lego-starwars-lego-technic-lego-minifigur/
@rwerschkull

But isN’t Data Vault v2
‘made for
Bigdata centric
systems?’
@rwerschkull

In
DV2
you still
do this
in one go
Subject
Oriented
Integrated
Time Variant Non-Volatile
EDW
@rwerschkull

Coding =
A lot like modelling?
Being Data Centric
conflicts
with the
complex Data
MODELLING
work
http://xkcd.com/844/

@rwerschkull

Less Mature
JOIN
optimizers!
Photo credit: Public Domain
@rwerschkull

Key-Value
Document
Column Family
NoSQL Databases
+SQL on Hadoop solutions
Do NOTlike joins!
@rwerschkull

This
REALLY
complicates
AnchoR
Modeling
@rwerschkull

And Personally...
HUB
SAT
LINK
This one
too!
(Link Satellite)
HUB
HUB
SAT
SAT
@rwerschkull

a) HASHING OF Business keys
Rolling
Stock Nr Datetime Sensor Id Value Concatenated Business Key
Key
Len MD5 Hash
Key
Len
8739
2015-01-22
01:34:27 72A1_FINV 123 8739|2015-01-22 01:34:27|72A1_FINV 34 86ae4c6b0e2e2d5a13a0d11440529aeb 32
8739
2015-01-22
01:34:27 72A1_SLDET 100 8739|2015-01-22 01:34:27|72A1_SLDET 35 51ce9bc292eef407bd7c91a52eebcf2e 32
8739
2015-01-22
01:34:32 72A1_FINV 126 8739|2015-01-22 01:34:32|72A1_FINV 34 9482a41c1fecc4c64b8c437af6cc85e8 32
8739
2015-01-22
01:34:32
13A8_MW_UB
AT_VT 5 8739|2015-01-22 01:34:32|13A8_MW_UBAT_VT 42 e4160914ee55ce0b93f87b23366a0ce3 32
8674
2015-01-22
01:34:26 72A1_FINV 6 8674|2015-01-22 01:34:26|72A1_FINV 34 fcb3e7c8c91e44ce396d908a4948ca65 32
8674
2015-01-22
01:34:26
16A1_HSVER
OND 7 8674|2015-01-22 01:34:26|16A1_HSVEROND 38 fe9098c8c291ad56af5c8afae5169196 32

Loses statistical Information
regarding the data distribution
 Query optimizers do not like this…
Column family, Document and Key-value databases need a
good (natural) sharding key for (partial) key-
lookups!*
Hashing...……
* http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2/
@rwerschkull

Surrogates keys require
centralized coordination
…and thus can impact the overall system’s scalability and
availability.
A lot of MPP / NoSQL databases simply do not have
them…
B) Surrogate BuSINESS keys
@rwerschkull

Then: Some Inspiration...
http://roelantvos.com/blog/?p=1119
@rwerschkull

‘In my opinion the answer lies in the adoption of the
persistent (Historical) Staging Area concept
(also known as Historical Staging or the History Area).
This basically adopts the fundamentals of a Data Warehouse’
‘The Historical Staging Area effectively ‘acts’ as
Data Lake,
but in a better defined form as data deltas and
event date/times are taken into account.’

So...
Time
Variant
&
Non Volatile
Subject
Oriented
&
Integrated
EDW
Subject Oriented Integrated
Time Variant Non-Volatile
EDW
@rwerschkull

Could be a...
Data
LAKE
?
Virtualised
Ensemble
Tier?
EDW
?
Time
Variant
&
Non Volatile
Subject
Oriented
&
Integrated
EDW
@rwerschkull

How
Does PS-3C
Work?
@rwerschkull

Staging
Area
EDW
Information
Marts
Focus of Current ensemble EDW’s
@rwerschkull

Persistent Staging
Area, HSA =
Data Library
EDW
Information
Marts
Splitting the work...
@rwerschkull

Persitent
Staging
-
Concept
Context
Connector
Business
Concept
@rwerschkull

modelling
@rwerschkull

Identify source / event stream Primary or Unique Key
 Use source metadata for this!
Automate the building of a PS ‘around this key’
 Take all columns!
 Historize using SCD-2 approach
Persistent Staging - how
@rwerschkull

Entity level
 Unique key
 Functional Description
 Delivering party
 Owner / Responsible
 MULTIPLE tags describing POTENTIAL business domains
(sales, support, marketing, operation,…)
 …
Persistent Staging Metadata-1
@rwerschkull

Column level
 [Load Date Timestamp]
 [Load End Date Timestamp]
 [Deleted Flag] OR delete as new record
 [Source system] on table / file level (lowest possible)
Load End Date Timestamp : possible but difficult…
 Requires updates!
Persistent Staging Metadata-2
@rwerschkull

ACID is possible in HIVE!
ACID Makes Updates possible
 By registering updates as ‘new data’
 Reconciliation / compacting when idle / at user command
Use ORC files!!!
PLUS changing the HIVE configuration…
UPDATES IN HIVE? (iSN’t HDFS APPEND ONLY?)
@rwerschkull

Hive
 Put semi structured data = variable columns in MAP data type
 OR use Data storage type that supports schema-evolution:
AVRO, (ORC in development)
Or HBASE…
 It only has one data type (byte), schema is ‘applied’
 Schema can be different for every row
What about SEMI-STRUCTURED Data?
@rwerschkull

3C - how
@rwerschkull

 Always starts with Conceptual data modeling
 NOT the primary location of Data & History
 Virtualised (only if performance allows). Should be deterministic!
 No Link Satellites
 No Surrogate or Hash Keys, only ‘Contatenated Natural Business Keys’
 Explicit Helper entities
Like Data Vault(2) BUT...
@rwerschkull

@rwerschkull
Business
Concept
X
Concept
Context
X-A
Concept
Context
X-B
Business
Concept
Y
Concept
Context
Y-A
Concept
Context
Y-B
Concept
Context
Y-C
Connector
Business
Alias A
Business
Alias B
BC-Timeline
X
3C - Details

a UNIQUE, Domain specific point of integration
 …a business entity
 …within it’s own domain
 …does not necessarily need to be Enterprise Wide!
Business Concept (BC)
@rwerschkull

Why not ‘enterprise wide?’
Company
Customer
Sales
Customer
International
Sales
Customer
Local
Sales
Customer
Marketing
Customer
Customer …
@rwerschkull

Entity level
 [Description]
 [Owner / Responsible]
Column level
Business Concept Metadata
@rwerschkull

Example-Data
NSR-Station
NS-
Travelcard
NS-
Trainseries
Business Key
IC|855
IC|8852
Sp|7455
St|16050
…
Business
Key
META_Source META_Load_dts
Ut NSS1_y 5-6-2015 22:00:00
Asd NSS1_y 5-6-2015 22:00:00
Asa NSS2_p 6-6-2015 22:00:00
… … …
NS-
Traveller
Business Key
3528 0234 2073 1234
3528 0234 2073 5678
…
Business Key
CRM-RW123456
CRM-LAS224466
…
@rwerschkull

Most easy entity to be virtualised
(if performance allows)
No Hashing & No surrogate
BUSINESSKEYS!
(not by default at least!)
BC: important notes
@rwerschkull

Containts Context about a Concept
 In a historical way
 …Like a Data Vault Satellite
Every CC belongs to only one BC
Seperate entity per source system / table / stream
Concept Context (CC)
@rwerschkull

Entity level
 [Description]
Column level
 Not mandatory for streaming data:
• [Load End Date Timestamp]
• [Deleted Flag] OR register a delete as a new record
Concept Context Metadata
@rwerschkull

Example
NSR-Station
NS-
Travelcard
NS-
Trainseries
NS-
Traveller
[valuation]
Source: NSS2 table p
[description]
Source:NSS1 table x
[adres]
Source: NSS1 table y
[description]
Source: NTR table q
[ovchip_
personal]
Source: NSR table r
[ovchip_
on-usage
Source: NSR table s
[personal_
details]
Source: NSR table t
[adres_
data]
Source: NSR table t @rwerschkull

Example-data
BK_NSR-
Station
Postadres_
postcode
GPS … META_
source
META_Load_dts META_Load_end_dts META_Deleted_Ind
Ut 3500GJ 52.08954, 5.11064 … NSS1_y 5-6-2013 22:00:00 5-6-2015 21:59:59 0
Ut 3511 CE 52.37269, 4.89299 … NSS1_y 5-6-2015 22:00:00 31-12-9999 00:00:00 0
Asd 1012 AB 52.37269, 4.89299 … NSS1_y 5-6-2013 22:00:00 31-12-9999 00:00:00 0
Asa … … … … … …
NSR-Station
[adres]

More difficult to be virtualised
 Depends on semantic gap with source!
 But do make virtual when ‘streaming data’ is necessary!
Because we have PS layer
 Exposing all columns not necessary!
 Refactoring is more easy…
BC: important notes
@rwerschkull

Relations between Concepts
 + Context
 In a historical way
 …Merger of Data Vault Link + Link Satellite
Must ALWAYS have a driving key defined
 = a (sub)set of keys that make a Connector unique at one point
in time
Connector (C)
@rwerschkull

Explicitly defining a driving key as metadata…
 Gives business understanding!
 Makes it possible Connector can correctly handle delta data
deliveries…
• so that a change (on the driving key)
• is not registered as a new ‘connection’
Connector Driving key
@rwerschkull

Entity level
 [Description]
Column level
 Not mandatory for streaming data:
• [Load End Date Timestamp]
• [Deleted Flag] OR register a delete as a new record
Connector Metadata
@rwerschkull

Example
NSR-Station
NS-
Travelcard
NS-
Trainseries
NS-
Traveller
[valuation]
[description]
Source:NSS1 table x
[adres]
[description]
Source: NTR table q
[ovchip_
personal]
Source: NSR table r
[ovchip_
on-usage
Source: NSR table s
[personal_
details]
Source: NSR table t
[adres_
data]
Source: NSR table t @rwerschkull
NSR-
Travelmovement
Checkin timestamp
from
to
Driving Key:
NS-Travelcard
+Checkin timestamp

Example-Data
NSR-Station
NS-
Travelcard
NS-
Trainseries
@rwerschkull
NSR-
Travelmovement
Checkin timestamp
from
to
BK_NSR-
Station-from
BK_NSR-
Station-to
BK_NS_
Trainseries
BK_NS-Travelcard Checkin
timestamp
Checkout
timestamp
META_Load_dts META_Load_end_dts
Asd Ut IC 855 3528 0234 2073
1234
5-4-2016 8:49:32 5-4-2016 9:40:12 6-4-2016
22:00:00
31-12-9999
00:00:00
Ut Asd IC 855 3528 0234 2073
1234
5-4-2016 18:10:09 5-4-2016 18:55:20 6-4-2016
22:00:00
31-12-9999
00:00:00

We add Two mandatory
HELPER
Entities
Are we there yet? No....
@rwerschkull

To help switching from sources that are Tied together
by technical (surrogate) keys…
To a Business Key based model
It’s a LOOKUP table that translates the technical to the
Business Key
Business Alias

Example
NSR-Station
[valuation]
[description]
Source:NSS1 table x[adres]
@rwerschkull
BA-NSS1
BA-NSS2
Key Lookup voor NSS1 source tables

Example-data
NSR-Station
@rwerschkull
BA-NSS1
Business Key NSS1_Surrogate_key
Ut 123522
Asd 666323
Asa 222443
… …
Business
Key
META_Source META_Load_dts
Ut NSS1_y 5-6-2015 22:00:00
Asd NSS1_y 5-6-2015 22:00:00
Asa NSS2_p 6-6-2015 22:00:00
… … …

Has a 1 on (0,1) relation with a Business Concept
More difficult to be virtualised
 Lookup table should be kept small!
 Therefore: DO NOT do key lookup in Concept Context entity!
Load / generate together with BC
Preferably ‘in memory’ somehow…
BA: important Details
@rwerschkull

Integrate the validity timelines of Concept Contexts
belonging to a Business Concept
Like a Data Vault Point-in-time construct
But Mandatory!
And with a clearly defined and performant
approach!
BC-Timeline
@rwerschkull

Example
NSR-Station
[valuation]
[adres]
@rwerschkull
BK_NSR-
Station
WOZ
waarde
Waarde
Ratingbureau X
META_Laad_dts META_Laad_eind_dts
Ut 20 milj 18 milj 1-1-2014 22:00:00 1-1-2015 21:59:59
Ut 22 milj 18 milj 1-1-2015 22:00:00 1-3-2016 21:59:59
Ut 22 milj 23 milj 1-3-2016 22:00:00 31-12-9999 00:00:00
BK_NSR-
Station
Combined_Load_dts Combined_Load_end_dts
Ut 5-6-2013 22:00:00 1-1-2014 21:59:59
Ut 1-1-2014 22:00:00 1-1-2015 21:59:59
Ut 1-1-2015 22:00:00 4-7-2015 21:59:59
Ut 4-7-2015 22:00:00 1-3-2016 21:59:59
Ut 1-3-2016 22:00:00 31-12-9999 00:00:00
Asd 5-6-2013 22:00:00 31-12-9999 00:00:00
BK_NSR-
Station
Postadres_
postcode
GPS … META_
source
META_Load_dts META_Load_end_dts
Ut 3500GJ 52.08954, 5.11064 … NSS1_y 5-6-2013 22:00:00 4-7-2015 21:59:59
Ut 3511 CE 52.37269, 4.89299 … NSS1_y 4-7-2015 22:00:00 31-12-9999 00:00:00
Asd 1012 AB 52.37269, 4.89299 … NSS1_y 5-6-2013 22:00:00 31-12-9999 00:00:00
Asa … … … … … …
BCT

What
Makes PS-3C
a Different Ensemble?
Business
Concept
X
Concept
Context
X-A
Concept
Context
X-B
Business
Concept
Y
Concept
Context
Y-A
Concept
Context
Y-B
Concept
Context
Y-C
Connector
Business
Alias A
Business
Alias B
BC-Timeline
X
@rwerschkull

1) Explicitly Splitting The work
Data
+
History
Subjects
+
Integration
@rwerschkull

2) NO HASHEDBUSINeSS KEYS
...or surrogate keys
http://www.cannabisculture.com/files/images/6/hashbrick.JPG
Only
Concatenated
ones
@rwerschkull

3) Less
joins
Relation
+Technical validity timeline
+ Relation context
Together in one entity
@rwerschkull

4) Explicit
HelpER
Entities
Business Alias
Business Component Timeline
+ explicitly define Driving key(s)
@rwerschkull

Hope to
AVOID
This...…
http://xkcd.com/927/

PS-3C
A new PROPOSED
ensemble modelling technique
Help
needed!
Questions?

About Me...
‘Head of BI’ @ Spillgames
Certified Data Vault modeler since
2009
Contact details:
 nl.linkedin.com/in/rogierwerschkull
 rogierwerschkull@gmail.com
 @rwerschkull

PS-3C Data Modelling Zone Berlin

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to PS-3C Data Modelling Zone Berlin

Similar to PS-3C Data Modelling Zone Berlin (20)

Recently uploaded

Recently uploaded (20)

PS-3C Data Modelling Zone Berlin