Data Provenance and its role in Data Science

DataScienceWorkshop
Islamabad,April2017
P.Missier
1
Data Provenance and its role in Data Science
Dr. Paolo Missier
School of Computing Science
Newcastle University, UK
Data Science Workshop
Islamabad, April 2017

DataScienceWorkshop
Islamabad,April2017
P.Missier
2
What is provenance?
Oxford English Dictionary:
• the fact of coming from some particular source or quarter; origin, derivation
• the history or pedigree of a work of art, manuscript, rare book, etc.;
• a record of the passage of an item through its various owners: chain of custody
Magna Carta (‘the Great Charter’) was
agreed between King John and his barons
on 15 June 1215.

DataScienceWorkshop
Islamabad,April2017
P.Missier
3
A PROV provenance graph
3
Editing phase
drafting commenting editingused
draft
v1
wasGeneratedBy used draft
comments
v2
wasGeneratedBy
BobBob-1 Bob-2
specializationOf
wasAssociatedWith
specializationOf
wasAssociatedWith
reading
wasDerivedFrom
paper3
used
Alice
wasAssociatedWith
actedOnBehalfOf
wasDerivedFrom
Remote past Recent past
wasGeneratedBy
distribution=internal
status=draft
version=0.1
status=draft
version=0.1
type=person
role=main_editortype=person
role=jr_editor
role=author
role=editor
role=author
wasAttributedTo
Publishing phase
guideline
update
publication
draft
v2
used
WD1
pub
guidelines
v1
wasGeneratedBy pub
guidelines
v2
wasGeneratedBy
wasDerivedFrom
Charlie
wasAssociatedWith
Alice
actedOnBehalfOf
w3c:
consortium
wasAssociatedWith
distribution=public
status=draft
version=1.0
type=person
role=headOfPublication
type=institution
role=issuer
3
Editing phase
drafting commenting editingused
draft
v1
comments
v2
wasGeneratedBy
BobBob-1 Bob-2
specializationOf
wasAssociatedWith
specializationOf
wasAssociatedWith
reading
wasDerivedFrom
paper3
used
Alice
wasAssociatedWith
actedOnBehalfOf
wasDerivedFrom
Remote past Recent past
wasGeneratedBy
status=draft
version=0.1
status=draft
version=0.1
type=person
role=main_editortype=person
role=jr_editor
role=author
role=editor
role=author
wasAttributedTo
Publishing phase
guideline
update
publication
draft
v2
used
WD1
pub
guidelines
v1
wasGeneratedBy pub
guidelines
v2
wasGeneratedBy
wasDerivedFrom
Charlie
wasAssociatedWith
Alice
actedOnBehalfOf
w3c:
consortium
wasAssociatedWith
distribution=public
status=draft
version=1.0
type=person
role=headOfPublication
type=institution
role=issuer

DataScienceWorkshop
Islamabad,April2017
P.Missier
4
The W3C Working Group on Provenance
W3C
Incubator group
on provenance
Chair: Yolanda Gil,
ISI, USC
W3C
working group
approved
Chairs:
Luc Moreau,
Paul Groth
2009-2010
Main output:
“Provenance XG Final Report”
http://www.w3.org/2005/Incubator/prov/XGR-prov/
- provides an overview of the various existing
approaches, vocabularies
- proposes the creation of a dedicated W3C Working
Group
April, 2011 April, 2013
Proposed
Recommendations
finalised
prov-dm: Data Model
prov-o: OWL ontology, RDF encoding
prov-n: prov notation
prov-constraints
...plus a number of non-prescriptive
Notes
http://www.w3.org/2011/prov/wiki/

DataScienceWorkshop
Islamabad,April2017
P.Missier
5
PROV: scope and structure
source: http://www.w3.org/TR/prov-overview/
Recommendation
track
See also:
Moreau, Luc, and Paul Groth. “Provenance: An Introduction to PROV.” Synthesis Lectures on the
Semantic Web: Theory and Technology 3, no. 4 (September 15, 2013): 1–129.
doi:10.2200/S00528ED1V01Y201308WBE007.

DataScienceWorkshop
Islamabad,April2017
P.Missier
6
PROV notation (PROV-N)
document
prefix prov <http://www.w3.org/ns/prov#>
prefix ex <http://www.example.com/>
entity(ex:draftComments)
entity(ex:draftV1, [ ex:distr='internal', ex:status = "draft"])
entity(ex:paper1)
entity(ex:paper2)
activity(ex:commenting)
activity(ex:drafting)
wasGeneratedBy(ex:draftComments, ex:commenting, 2013-03-18T11:10:00)
used(ex:commenting, ex:draftV1, -)
wasGeneratedBy(ex:draftV1, ex:drafting, -)
used(ex:drafting, ex:paper1, -)
used(ex:drafting, ex:paper2, -)
endDocument

DataScienceWorkshop
Islamabad,April2017
P.Missier
7
Same example — PROV-O notation
:draftComments a prov:Entity ;
:distr "internal"^^xsd:string ;
prov:wasGeneratedBy :commenting .
:commenting a prov:Activity ;
prov:used :draftV1 .
:draftV1 a prov:Entity ;
:distr "internal"^^xsd:string ;
:status "draft"^^xsd:string ;
:version "0.1"^^xsd:string ;
prov:wasGeneratedBy :drafting .
:drafting a prov:Activity ;
prov:used :paper1,
:paper2 .
:paper1 a prov:Entity,
"reference"^^xsd:string .
:paper2 a prov:Entity,
"reference"^^xsd:string .
(RDF / Turtle notation)

DataScienceWorkshop
Islamabad,April2017
P.Missier
8
Association, Attribution, Delegation: who did what?
An activity association is an assignment of responsibility to an agent for an activity,
indicating that the agent had a role in the activity.
Attribution is the ascribing of an entity to an agent.
entity(ex:draftComments, [ ex:distr='internal' ])
activity(ex:commenting)
agent(ex:Bob, [prov:type = "mainEditor"] )
agent(ex:Alice, [prov:type = "srEditor"])
wasAssociatedWith(ex:commenting, Bob, -, [prov:role = "editor"])
actedOnBehalfOf(Bob, Alice)
wasAttributedTo(ex:draftComments, ex:Bob)

DataScienceWorkshop
Islamabad,April2017
P.Missier
9
Same example — PROV-O notation (RDF/N3)
:Alice a prov:Agent,
"ex:chiefEditor";
:firstName "Alice";
:lastName "Cooper".
:Bob a prov:Agent,
"ex:seniorEditor";
:firstName "Robert";
:lastName "Thompson"^;
prov:actedOnBehalfOf :Alice .
:draftComments prov:wasAttributedTo :Bob .
:drafting a prov:Activity ;
prov:wasAssociatedWith :Bob .

DataScienceWorkshop
Islamabad,April2017
P.Missier
10
Association and Attribution
Q.: what is the relationship between attribution and association?
This is defined as an inference rule in the PROV-CONSTR document
entity(e)
agent(Ag)
activity(a)
wasAttributedTo(e, Ag)
wasGeneratedBy(e, a,-)
wasAssociatedWith(a, Ag,-)

DataScienceWorkshop
Islamabad,April2017
P.Missier
11
Three Views of Provenance

DataScienceWorkshop
Islamabad,April2017
P.Missier
12
Derivation amongst entities
A derivation is a transformation of an entity into another, an update of an entity
resulting in a new one, or the construction of a new entity based on a pre-existing
entity.
entity(ex:draftV1)
entity(ex:draftComments)
wasDerivedFrom(ex:draftComments, ex:draftV1)
Q.: what is the relationship between derivation, generation, and usage?
:draftComments a prov:Entity ;
prov:wasDerivedFrom :draftV1 .
:draftV1 a prov:Entity .

DataScienceWorkshop
Islamabad,April2017
P.Missier
13
From “scruffy” provenance to “valid” provenance
- Are all possible temporal partial ordering of events equally acceptable?
- How can we specify the set of all valid orderings?
- how do we formally define what it means for a set of provenance
statements to be valid?
PROV defines a set of temporal constraints that ensure consistency
of a provenance graph

DataScienceWorkshop
Islamabad,April2017
P.Missier
14
Talk Outline
• Provenance, why? (in science)
• Provenance of Scientific Data
• The DataONE Federation of Data Repositories (dataone.org)
• Provenance for Data Science
• Provenance-enabled data analytics frameworks
• Provenance in the ReComp project
• (Provenance for streaming data analytics)

DataScienceWorkshop
Islamabad,April2017
P.Missier
15
Talk Outline
• Provenance for streaming data analytics

DataScienceWorkshop
Islamabad,April2017
P.Missier
16
Why provenance?
Provenance in machine learning:
• Why is my predictive algorithm recommending these new friends to me?
• How can I trust my classifier’s predictions?
[1] Ceolin, D., Groth, P., Maccatrozzo, V., Fokkink, W., Hage, W. R. Van, & Nottamkandath, A. (2016).
Combining User Reputation and Provenance Analysis for Trust Assessment. J. Data and
Information Quality, 7(1–2), 6:1--6:28. http://doi.org/10.1145/2818382
• Reproducibility of your own and your peers’ work
• i.e. in experimental Science
Example: assessing trust in Web artifacts and crowdsourced annotations [1]
• Communication:
To engender trust in the data and amongst the people and systems that are
responsible for it
• Understandability:
• to explain the outcome of a complex decision process

DataScienceWorkshop
Islamabad,April2017
P.Missier
17
Trusted Web data: Provenance on the Web
Tim Berners-Lee’s “Oh Yeah” button:
http://users.ugent.be/~tdenies/OhYeah/
Easy Access to Provenance: an Essential Step Towards Trust on the Web, Procs METHOD 2013: The 2nd IEEE
International Workshop on Methods for Establishing Trust with Open Data Held in conjunction with COMPSAC,
the IEEE Signature Conference on Computers, Software & Applications - July 22-26, 2013 - Kyoto, Japan
http://dx.doi.org/10.1109/COMPSACW.2013.29

DataScienceWorkshop
Islamabad,April2017
P.Missier
18
Understandability: explaining process outcomes
• Which process was used to derive a
diagnosis?
• How did the process use the input
data?
• How were the steps configured?
• Which decisions were made by
human experts (clinicians)?
MAF threshold
- Non-synonymous
- stop/gain
- frameshift
known polymorphisms
Homo / Heterozygous
Pathogenicity
predictors
Variant filtering
HPO match
HPO to OMIM
OMIM match
OMIM to Gene
Gene
Union
Gene
Intersect
Genes in scope
User-supplied
genes list
User-supplied
disease keywords
User-defined
preferred genes
Variant Scoping
Candidate
variants
Select
variants
in scope
variants
in scope
ClinVar
lookupClinVar
Annotated
patient
variants
Variant Classification
RED:
found,
pathogenic
AMBER:
not found
GREEN:
found,
benign
OMIM
AMBER/
not found
AMBER/
uncertain
NGS
pipeline
Clinical diagnosis of genetic diseases
Example provenance query:
“Find all invocations that used a
specific version of ClinVar and OMIM,
and group them by phenotype”

DataScienceWorkshop
Islamabad,April2017
P.Missier
19
Reproducibility and dissemination in Science
Experimental science is data-intensive
Independent validation of results claims is a
cornerstone of scientific discourse
Provenance is the equivalent of a formal logbook
• Capture all steps involved in the derivation of a
result
• How much detail?
• Replay, validate, compare

DataScienceWorkshop
Islamabad,April2017
P.Missier
20
Lifecycle of experimental datasets
Search
discover
packagepublish
spec(P’)
Deploy
P’ 
Env(dep’)
prov(D’)
Compare
(P,P’,D,D’)
spec(P)
prov(D)
D  D1
P  P’
dep  dep’
<D,P,dep,spec(P), prov(D)>
compute
Env
D’
D1

DataScienceWorkshop
Islamabad,April2017
P.Missier
21
Reproducibility: working. reporting
submit article
and move on…
publish article
Research
Environment
Publication
Environment
Peer
Review

DataScienceWorkshop
Islamabad,April2017
P.Missier
22
Re-what?
Re-*
ReRun:
vary experiment and setup, same lab
P P’
DD’
depdep’
Repeat:
Same experiment, setup, lab
P, D, dep, env(dep)
Replicate:
Same experiment, setup, different lab
P, D, dep, env’(dep)
Reproduce:
vary experiment and setup, different lab
P P’
DD’
depdep’
env(dep) env’(dep’)
Reuse:
Different experiment
D, P  Q

DataScienceWorkshop
Islamabad,April2017
P.Missier
23
Lifecycle with tools annotations
Search
discover
packagepublish
D  D1
P  P’
dep  dep’
compute
Env
D’
prov(D’)
Compare
(P,P’,D,D’)
spec(P)
prov(D)
Research
Objects
DataONE
Federated
Research Data
Repositories
TOSCA-based
virtualisation
Pdiff: differencing
provenance
YesWorkflow
- Workflow
Provenance
- NoWorkflow
Matlab
provenance
recorder
(DataONE)
ReproZip
spec(P’)
Deploy
P’ 
Env(dep’)

DataScienceWorkshop
Islamabad,April2017
P.Missier
24
References
Research Objects: www.researchobject.org
Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et
al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011).
doi:doi:10.1016/j.future.2011.08.004.
DataONE: dataone.org
Cuevas-Vicenttín, Víctor, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati,
Yaxing Wei, David Koop, and Saumen Dey. “The PBase Scientific Workflow Provenance Repository.”
In Procs. 9th International Digital Curation Conference, 9:28–38. San Francisco, CA, USA, 2014.
doi:10.2218/ijdc.v9i2.332.
Process Virtualisation using TOSCA
Qasha, Rawaa, Jacek Cala, and Paul Watson. “Towards Automated Workflow Deployment in the Cloud Using
TOSCA.” In 2015 IEEE 8th International Conference on Cloud Computing, 1037–1040. New York, 2015.
doi:10.1109/CLOUD.2015.146.
NoWorkflow: provenance recording for Python
Murta, Leonardo, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire.
“noWorkflow: Capturing and Analyzing Provenance of Scripts⋆.” In Procs. IPAW’14. Cologne,
Germany: Springer, 2014.
YesWorkflow: Qian Zhang, Yang Cao, Qiwen Wang, Duc Vu, Priyaa Thavasimani, Timothy
McPhillips, Paolo Missier, Bertram Ludäscher, Revealing the Detailed Lineage of Script Outputs using
Hybrid Provenance, Procs. International Data Curation Conference, DCC, Edinburgh, 2017.

DataScienceWorkshop
Islamabad,April2017
P.Missier
25
PDIFF Comparing provenance traces
15
A graph obtained as a result of traces “diff”
which can be used to explain observed differences in workflow outputs, in
terms of differences throughout the two executions.
This is the simplest
possible delta “graph”!d1
S0
d2
S1
z w
S2
d3
yx
S3
S4
df
d1'
S0
d2'
S1
z w'
S2
d3
y'x
S3
S4
df'
(i) Trace A (ii) Trace B
P0 P1
P0 P1
P0 P1
P0 P1
dF , dF
y, y
w, w
d2, d2
(iii) Delta tree
Two executions of the same workflow, with slight differences:
- Unintentional changes: eg incorrect porting/re-deployment  cause analysis
- Intentional changes: eg different parameter settings  impact analysis
Missier, P., Woodman, S., Hiden, H., & Watson, P. (2013). Provenance and data differencing for
workflow reproducibility analysis. Concurrency and Computation: Practice and Experience, 28(4),
995–1015. http://doi.org/10.1002/cpe.3035

DataScienceWorkshop
Islamabad,April2017
P.Missier
26
Components for a flexible, scalable,
sustainable network
DataONE: Cyberinfrastructure
www.dataone.org/member-nodes
Coordinating Nodes
• retain complete
metadata catalog
• indexing for search
• network-wide services
• ensure content
availability
(preservation)
• replication services
Member Nodes
• diverse institutions
• serve local community
• provide resources for
managing their data
• retain copies of data

DataScienceWorkshop
Islamabad,April2017
P.Missier
27
Cyberinfrastructure
Data Services: Extraction,
sub-setting etc
ontolog
y
annotation
System
Metadata
Science
Data
Search
API
Science
Metadata
Provenance
Replicate
Metadata
Index

DataScienceWorkshop
Islamabad,April2017
P.Missier
28
Data Holdings

DataScienceWorkshop
Islamabad,April2017
P.Missier
29
ProvONE: extending PROV with process structure
https://purl.dataone.org/provone-v1-dev
Yang Cao, Christopher Jones, Vıctor Cuevas-Vicenttın, Matthew B. Jones, Bertram Ludascher, Timothy
McPhillips, Paolo Missier, Christopher Schwalm, Peter Slaughter, Dave Vieglais, Lauren Walker, Yaxing Wei,
ProvONE: extending PROV to support the DataONE scientific community, TAPP workshop on Theory and
Practice of Provenance, 2016.
Workflow structure
Retrospective provenance

DataScienceWorkshop
Islamabad,April2017
P.Missier
30
Simple user-level provenance visualisation

DataScienceWorkshop
Islamabad,April2017
P.Missier
31
Database provenance
• Why is record R included in the result of query? [why-provenance] (*)
(*) Cheney, J., Chiticariu, L., & Tan, W.-C. (2009). Provenance in Databases: Why, How, and Where.
Foundations and Trends in Databases, 1, 379–474.
(+) Herschel, M., & Hernández, M. A. (2010). Explaining missing answers to SPJUA queries.
Proceedings of the VLDB Endowment, 3(1–2), 185–196. http://doi.org/10.14778/1920841.1920869
SIGMOD Tutorial 2007 Provenance in Databases 7
Example of Data Provenance
n A typical question:
n For a given database query Q, a database D, and a tuple t in the output
of Q(D), which parts of D “contribute” to t?
n The provenance of tuple (John, D01, Mary) in the output consists of
the source facts R(John, D01) and S(D01, Mary) according to the
query Q.
n The question could also be applied to an attribute value, a table, or
any subtree in hierarchical/tree-like data.
R
Emp Dept
John D01
Susan D02
Anna D04
S
Did Mgr
D01 Mary
D02 Ken
D03 Ed Q = select r.A, r.B, s.C
from R r, S s
where r.B = s.B
Q
Emp Dept Mgr
John D01 Mary
Susan D02 Ken
• Why is record R not in the result? [why-not provenance] (+)
Source: Provenance in Databases: Past, Current, Future. Peter Buneman University of Edinburgh. Wang-Chiew
Tan UC Santa Cruz. SIGMOD Tutorial 2007

DataScienceWorkshop
Islamabad,April2017
P.Missier
32
Talk Outline
• Provenance, of Scientific Data: The DataONE Federation of Data
Repositories (dataone.org)

DataScienceWorkshop
Islamabad,April2017
P.Missier
33
Provenance from analytics
Analytics data processing generates potentially valuable knowledge
But:
Credibility of the outcomes requires
- Understandability of data processing
human-oriented: "Prospective provenance"
- Provenance recording and query
machine-oriented: PROV  XML, RDF, Neo4J data + query models
Problem:
- Process complexity, lack of transparency 
- Provenance is coarse-grained or complex to understand

DataScienceWorkshop
Islamabad,April2017
P.Missier
34
Some research prototypes
Titian (*)
Apache Spark maintains the program
transformation lineage to recover from
failures
• Titian enhances the Spark RDD
programming model
• data provenance capture
• interactive query support
(*) Interlandi, M., Shah, K., Tetali, S. D., Gulzar, M. A., Yoo, S., Kim, M., Condie, T. (2015). Titian: Data Provenance
Support in Spark. Proc. VLDB Endow., 9(3), 216–227. http://doi.org/10.14778/2850583.2850595
Lipstick on a pig (+)(++)
A framework that marries database-
style and workflow provenance models
The catch… all modules must be
implemented in Pig Latin
(+) Amsterdamer, Y., Davidson, S. B., Deutch, D., Milo, T., Stoyanovich, J., & Tannen, V. (2011). Putting lipstick on
pig: enabling database-style workflow provenance. Proc. VLDB Endow., 5(4), 346–357.
http://dl.acm.org/citation.cfm?id=2095686.2095693
(++) Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (2008). Pig latin: a not-so-foreign language for
data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data
(pp. 1099–1110). New York, NY, USA: ACM. http://doi.org/http://doi.acm.org/10.1145/1376616.1376726

DataScienceWorkshop
Islamabad,April2017
P.Missier
35
Some research prototypes
RAMP (**) (Reduce And Map Provenance)
An extension to Hadoop that transparently wraps each function with
provenance capture
(**) Ikeda, R., Park, H., & Widom, J. Provenance for generalized map and reduce workflows. In: CIDR 2011.

DataScienceWorkshop
Islamabad,April2017
P.Missier
36
Provenance for analytics: Titian
Apache Spark natively maintains the program transformation lineage so that it can
reconstruct lost RDD partitions in the case of a failure
• Titian enhances it with data provenance capture and interactive query support
that extends the Spark RDD programming model
• With limited overhead of less than 30%
[1] Interlandi, M., Shah, K., Tetali, S. D., Gulzar, M. A., Yoo, S., Kim, M., Condie, T. (2015). Titian: Data Provenance
Support in Spark. Proc. VLDB Endow., 9(3), 216–227. http://doi.org/10.14778/2850583.2850595
LineageRDD methods for traversing through the data
lineage in both backward and forward directions [1] Job workflow after adding the lineage capture points

DataScienceWorkshop
Islamabad,April2017
P.Missier
37
Provenance for analytics: “Lipstick on a pig”
[2] Amsterdamer, Y., Davidson, S. B., Deutch, D., Milo, T., Stoyanovich, J., & Tannen, V. (2011). Putting lipstick on
pig: enabling database-style workflow provenance. Proc. VLDB Endow., 5(4), 346–357.
http://dl.acm.org/citation.cfm?id=2095686.2095693
[3] Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (2008). Pig latin: a not-so-foreign language for
data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data
(pp. 1099–1110). New York, NY, USA: ACM. http://doi.org/http://doi.acm.org/10.1145/1376616.1376726
“A framework that marries database-style and workflow provenance models
capturing internal state as well as fine-grained dependencies in workflow provenance”
The catch… all modules must be
implemented in Pig Latin:
“an emerging language that
combines high-level declarative
querying with low-level procedural
programming and parallelization in
the style of map-reduce” [3]

DataScienceWorkshop
Islamabad,April2017
P.Missier
38
Provenance for analytics: Map-Reduce
[4] Ikeda, R., Park, H., & Widom, J. (2011). Provenance for generalized map and reduce workflows. In: CIDR 2011.
Scope: generalized map and reduce workflows (GMRWs)
• input data sets are processed by an acyclic graph of map and reduce functions
Transformations: Map, Reduce, Union, Split
• Each transformation has an associated provenance operator
RAMP (Reduce And Map Provenance), an extension to Hadoop that transparently
wraps each function with provenance capture:

DataScienceWorkshop
Islamabad,April2017
P.Missier
39
[5] Crawl, D., Wang, J., & Altintas, I. (2011). Provenance for MapReduce-based data-intensive workflows. In
Proceedings of the 6th workshop on Workflows in support of large-scale science (pp. 21–30).
Kepler+Hadoop framework
Works for Kepler workflows that invoke Hadoop jobs
Capture provenance inside the MapReduce job as

DataScienceWorkshop
Islamabad,April2017
P.Missier
40
[6] Murray, D. G., McSherry, F., Isard, M., Isaacs, R., Barham, P., & Abadi, M. (2016). Incremental, iterative data
processing with timely dataflow. Communications of the ACM, 59(10), 75–83. http://doi.org/10.1145/2983551
Extend the MapReduce framework with change propagation
• The framework keeps track of the
dependencies between subsets of
each MapReduce computation
• When a subset of the input
changes, rebuilds only the parts of
the computation and the output
affected by the changes

DataScienceWorkshop
Islamabad,April2017
P.Missier
41
Challenges
• Too little, too much provenance
• Not at the right level of abstraction
Complex, black box analytics  coarse-grained provenance
White-box analytics frameworks  fine-grained, too detailed
• need abstraction / view mechanisms!
Ad hoc analytics (eg Python, R) ??
Need flexible, user-level provenance capture from complex analytics

DataScienceWorkshop
Islamabad,April2017
P.Missier
42
Additional recent research on Provenance and Big Data
Chen, Peng; Plale, Beth A., "Big Data Provenance Analysis and Visualization," Cluster, Cloud and Grid
Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.797,800, 4-7
May 2015 doi: 10.1109/CCGrid.2015.85
Chen, Peng; Plale, Beth A., "ProvErr: System Level Statistical Fault Diagnosis Using Dependency
Model," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium
on , vol., no., pp.525,534, 4-7 May 2015
doi: 10.1109/CCGrid.2015.86
Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs
Peter Macko and Margo Seltzer, Harvard University, Procs. TAPP’11, 2011, Crete, Greece
Provenance from Log Files: a BigData Problem, Devarshi Ghoshal and
Beth Plale, Procs. BigProv workshop, EDBT, Genova, Italy, 2013
Adam Bates, Kevin Butler and Thomas Moyer. Take Only What You Need: Leveraging Mandatory
Access Control Policy to Reduce Provenance Storage Costs. In Procs. TAPP’15 workshop,
Edinburgh, 2015
http://workshops.inf.ed.ac.uk/tapp2015/TAPP15_II_3.pdf

DataScienceWorkshop
Islamabad,April2017
P.Missier
43
Talk Outline

DataScienceWorkshop
Islamabad,April2017
P.Missier
44
ReComp
Metadata analytics:
Provenance for selective re-computation of big data analytics
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
Funded by the EPSRC on the Making Sense from Data call (2016 – 2019)
http://recomp.org.uk/

DataScienceWorkshop
Islamabad,April2017
P.Missier
45
Example: NGS variant calling and clinical interpretation
Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis
- Eg 100K Genome Project, Genomics England, GeCIP
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
recalibrate
alignments
calculate
coverage
coverage
informationraw
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Also: Metagenomics: Species identification. Eg The EBI metagenomics portal
Workflow helps confirm/reject hypotheses of patient’s phenotype
Classifies variants into three categories: RED, GREEN, AMBER
(pathogenic, benign and unknown/uncertain)

DataScienceWorkshop
Islamabad,April2017
P.Missier
46
ReComp
Record execution
history
• transparency
Detect and
measure
changes
Estimate impact of
changes
• Scoping
• Prioritisation
Enact on demand
• Partial re-run
• Differential execution
History
DB
Data
diff(.,.)
functions
Changes:
• Algorithms and tools
• Accuracy of input sequences
• Reference databases
Change
Events

DataScienceWorkshop
Islamabad,April2017
P.Missier
47
Goal: Minimise re-comp effort across all prior outcomes
Objective:
• Reduce the amount of computation performed in reaction to changes
Constraint: selective re-computation should be lossless
• All instances that may be subject to impact must be considered
• P1: Partial re-execution
• Reduce re-computation to only those parts of a process that are actually
involved in the processing of the changed data
• P2: Differential execution
• Insight: If an instance I of process P is executed using the delta between two
versions of the inputs and it produces empty result, then I is not affected by the
version change
• Only feasible if some algebraic properties of the process hold
• P3: Identifying the scope of change
• Determine which instances out of a population of outcomes are going to be
affected by the change

DataScienceWorkshop
Islamabad,April2017
P.Missier
48
SVI process: implementation using workflow
Phenotype to genes
Variant selection
Variant classification
Patient
variants
GeneMap
ClinVar
Classified variants
Phenotype
hypothesis
e-Science Central WFMS

DataScienceWorkshop
Islamabad,April2017
P.Missier
49
The ProvONE provenance data model
Workflow structure
Retrospective provenance

DataScienceWorkshop
Islamabad,April2017
P.Missier
50
P1: Partial re-execution
Objective: find the minimal sub-graph of a workflow that is affected by a change
Approach:
• e-SC generates one ProvONE provenance trace for each workflow run
• Use traces to identify the minimal sub-workflow that is affected by the change
:- invocation(I), wasPartOf(A,I),
wasDerivedFrom(d’,Dep),used(A,Dep)
Initial query: given a new version d’ of a reference database d:
Variable I denotes one execution instance (workflow invocation)
Traversal queries:
:- execution(A1), execution(A2),
wasPartOf(A,I), wasInformedBy(A2,A1)
collect all other activities
connected to A (implicit
data transfer)
:- execution(A1), execution(A2),
wasGeneratedBy(Data,A1), used(A2,Data)
collect all other activities
connected to A (explicit
data transfer D from A1
to A2)
Find all activities A within
invocation I, that used a
prior version Dep of d’

DataScienceWorkshop
Islamabad,April2017
P.Missier
51
Minimal sub-graphs in SVI
Change in
ClinVar
Change in
GeneMap
Partial execution following a change in only one of the databases requires caching the
intermediate data at the boundary of the blue and read areas

DataScienceWorkshop
Islamabad,April2017
P.Missier
52
Results
• How much can we save?
• Process structure
• First usage of reference data
• Overhead: storing interim data required in partial re-execution
• 156 MB for GeneMap changes and 37 kB for ClinVar changes
Time savings Partial
re-execution
(sec)
Complete
re-execution
(sec)
Time saving (%)
GeneMap 325 455 28.5
ClinVar 287 455 37

DataScienceWorkshop
Islamabad,April2017
P.Missier
53
P2: Differential execution
Suppose D is a relation (a table). diffD() can be expressed as:
Where:
The idea is to compute:
as the combination of:
This is effective if:
If the operators that make up P satisfy certain algebraic properties(*), then this
can be achieved as follows:
(*) Associative, distributive over set union and difference

DataScienceWorkshop
Islamabad,April2017
P.Missier
54
P2: Partial re-computation using input difference
Insight: run SVI but replace ClinVar query with a query on ClinVar version diff:
Q(CV)  Q(diff(CV1, CV2))
Works for SVI, but hard to generalise: depends on the type of process
Gain: diff(CV1, CV2) much smaller than CV2
GeneMap versions
from –> to
ToVersion rec.
count
Difference
rec. count Reduction
16-03-08 –> 16-06-07 15910 1458 91%
16-03-08 –> 16-04-28 15871 1386 91%
16-04-28 –> 16-06-01 15897 78 99.5%
16-06-01 –> 16-06-02 15897 2 99.99%
16-06-02 –> 16-06-07 15910 33 99.8%
ClinVar versions
from –> to
ToVersion rec.
count
Difference
rec. count Reduction
15-02 –> 16-05 290815 38216 87%
15-02 –> 16-02 285042 35550 88%
16-02 –> 16-05 290815 3322 98.9%

DataScienceWorkshop
Islamabad,April2017
P.Missier
55
P3: Identifying the scope of change: a game of battleship
Patient / change impact matrix
Challenge:
precisely identify the scope of a change
Blind reaction to change: recompute the entire matrix
Can we do better?
- Hit the high impact cases (the X) without re-
computing the entire matrix

DataScienceWorkshop
Islamabad,April2017
P.Missier
56
A scoping algorithm
Coarse-grained provenance indicates whether or not a dependency on D existed
… but not which specific data from version Dt of D was used
Candidate invocation:
Any invocation I of P whose provenance contains statements of the form:
used(A,Dt), wasPartOf(A,I), wasAssociatedWith(I,_,P)
used(I,Dt),wasAssociatedWith(I,_,P)
- For each candidate invocation I of P:
- partially re-execute using the difference sets as inputs
- compute the minimal subgraph P’ of P that needs re-computation
- repeat:
execute P’ one step at-a-time
until <empty output> or <P’ completed>
- If <P’ completed> then
- Execute P’ on the full inputs
Sketch of the algorithm (simplified):

DataScienceWorkshop
Islamabad,April2017
P.Missier
57
Scoping: precision
• The approach avoids the majority of re-computations given a ClinVar change
• Reduction in number of complete re-executions from 495 down to 71

DataScienceWorkshop
Islamabad,April2017
P.Missier
58
Scoping: efficiency
Total re-computation time for the whole patient cohort
0
1
2
3
4
5
Executiontime[hours]
ClinVar update date [mm/yy]
CV blind CV selective process
CV selective scope, δ-gen CV selective scope, δ-SVI
We expect to pay a penalty for running the algorithm when the difference sets are
large compared to actual new data
More accurate diff functions result in higher runtime savings
• Smaller difference sets
• More precise scoping

DataScienceWorkshop
Islamabad,April2017
P.Missier
59
Provenance in ReComp: Summary and Challenges
Objective:
• Reduce the amount of computation performed in reaction to changes
1. Partial re-execution of previously computed workflows
2. (Differential execution)
3. Identifying the scope of change
• Makes use of (2) to determine which instances of a population of outcomes are
going to be affected by a change
Challenges / work in progress:
• Validate and extend the approach to other case studies
• Design estimators to predict the impact of change
• Design and implement a generic ReComp meta-process
• Observe P in execution
• Detect changes
• Selectively react to changes

DataScienceWorkshop
Islamabad,April2017
P.Missier
60
Talk Outline
• Provenance, of what?
• Of Scientific Data: The DataONE Federation of Data Repositories (dataone.org)
• Of database data (very briefly)
• Of Web data  the W3C PROV data model for provenance
• Provenance for streaming data analytics

DataScienceWorkshop
Islamabad,April2017
P.Missier
61
Process-specific provenance using templates
Process
definition:
P: X --> Y
Provenance
template T
Sidecar binding
process definition:
PB: <xi,yi,P> --> B
Execution:
xi --> Pi --> yi
<xi,yi,P> PB B
Apply(B,T)
PROV
document
P
Aim: enable provenance generation from black box analytics
Approach:
1) human-oriented: "Prospective provenance" --> YesWorkflow
2) machine-oriented: PROV --> XML, RDF, Neo4J data + query models
The resulting provenance is
• Application-level
• User-specified
• Coarse-grained

DataScienceWorkshop
Islamabad,April2017
P.Missier
62
Illustration: the case of map()
[y1 … yn] = map(λ x.f(x), [x1 … xn])
Process definition:
map: <X, lambda x: f(x)> --> Y
Provenance
template T
:x :y
f
:a
Execution:
[y1 … yn] = map(lambda x: f(x), [x1 … xn])
:y. :a. f :x
Binding:
B = { <:x ← x1, :y ← y1, :a ← map1>,
…
<:x ← xn, :y ← yn, :a ← mapn> }
:x1 :map1 :y1
Type:plan
Type:plan
f
:xn :yn:map1
Apply(B,T)
Sidecar process BP
used
used
used
gen
gen
gen
assoc
assoc

DataScienceWorkshop
Islamabad,April2017
P.Missier
63
PROV-N rendering of the map() case
[y1 … yn] = map(λ x.f(x), [x1 … xn])
Provenance template T
entity(f, [prov:type = ‘prov:plan'])
entity(:x)
entity(:y)
activity(:a), wasAssociatedWith(:a,_,f)
used(:a,:x), wasGeneratedBy(:y,:a)
wasDerivedFrom(:y,:x)
B = {<:a ← gen(a, i), :x ← xi, :y ← yi > |i : 1 . . . n}
entity(f, [prov:type = ‘prov:plan'])
activity(a_1),
wasAssociatedWith(a_1,_,f)
…
activity(a_n),
wasAssociatedWith(a_n,_,f)
entity(x_1), ..., entity(x_n)
entity(y_1), ... ,entity(y_n)
used(a_1, x_1), wasGeneratedBy(y_1, a_1)
…
used(a_n, x_n), wasGeneratedBy(y_n, a_n)
wasDerivedFrom(y_1,x_1)
…
wasDerivedFrom(y_n,x_n)
Prov = apply(B,T)

DataScienceWorkshop
Islamabad,April2017
P.Missier
64
Application of template approach to streaming analytics
Data in movement is a prime source for value-added analytics applications
- Eg data streams from Internet of Things devices
The provenance of an output data stream is a stream of provenance statements
The template / sidecar process / binding framework applies without changes:
Following the Spark streaming model:
A stream is discretised into a sequence of micro-batch intervals
Window W = [ x1 … xk]: user-defined sequence of intervals xi
Process P: Wi  Y operates on a window at a time
(Y may be a sequence or other multivariate data structure)

DataScienceWorkshop
Islamabad,April2017
P.Missier
65
Provenance streams
:w :y:a
used gen
A simple template: ”each :y is generated using the content of one window”
B1
P , B2
P , … invoked alongside each invocation of P1, P2 ,…
Each Bi
P produces a binding Bi
Each Bi is applied to the template, producing a PROV document Provi
… this results in a stream of provenance alongside the stream of outputs yi
Execution:
wi = [xi1 … xin]
y1 = P1 (w1)  B1 = B1
P (w1, P1, y1)
y2 = P2 (w2)  B2 = B2
P (w2, P2, y2)
…
:y :a :w
Prov1 = apply(B1,T)
Prov2 = apply(B2,T)
…

DataScienceWorkshop
Islamabad,April2017
P.Missier
66
Extensions
The framework applies to a stateful process P
- P’s outcome depends on an internal state S
- P’s execution may modify S
Hint: S is itself an entity with provenance, defined by its own template and bindings
Add flexibility by allowing multiple templates and multiple sidecar processes for the
same process execution
The actual provenance output at the end of the process:
Prov = apply(B,T)
is an extensional representation.
< T, B> is an intensional representation
More space-efficient (T is only stored once, B is a set of variable-value pairs)
Only serialised when needed

DataScienceWorkshop
Islamabad,April2017
P.Missier
67
Summary
• Provenance, of Scientific Data: The DataONE Federation of Data
Repositories (dataone.org)

DataScienceWorkshop
Islamabad,April2017
P.Missier
68
Questions?
http://recomp.org.uk/

DataScienceWorkshop
Islamabad,April2017
P.Missier
69
Selected bibliography
Moreau, Luc, Paolo Missier, Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell,
et al. PROV-DM: The PROV Data Model. Edited by Luc Moreau and Paolo Missier, 2012.
http://www.w3.org/TR/prov-dm/
Cheney, James, Paolo Missier, and Luc Moreau. Constraints of the Provenance Data Model, 2012.
http://www.w3.org/TR/prov-constraints/
Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. “The Rationale of PROV.” Web
Semantics: Science, Services and Agents on the World Wide Web (April 2015).
doi:10.1016/j.websem.2015.04.001.
http://www.sciencedirect.com/science/article/pii/S1570826815000177
Marinho, Anderson, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio Manuel Serra da Cruz,
Eduardo Ogasawara, and Marta Mattoso. “ProvManager: a Provenance Management System for Scientific
Workflows.” Concurrency and Computation: Practice and Experience 24, no. 13 (2012): 1513–1530.
http://dx.doi.org/10.1002/cpe.1870.
ProvGen: generating synthetic PROV graphs with predictable structure.
Firth, H.; and Missier, P. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer
http://arxiv.org/pdf/1406.2495
ProvAbs: model, policy, and tooling for abstracting PROV graphs.
Missier, P.; Bryans, J.; Gamble, C.; Curcin, V.; and Danger, R. In Procs. IPAW 2014 (Provenance and
Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.1998
De Oliveira, Daniel, Vítor Silva, and Marta Mattoso. “How Much Domain Data Should Be in Provenance
Databases?” In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15). Edinburgh,
Scotland: USENIX Association, 2015. https://www.usenix.org/conference/tapp15/workshop-
program/presentation/de-oliveira.

Data Provenance and its role in Data Science

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Data Provenance and its role in Data Science

Similar to Data Provenance and its role in Data Science (20)

More from Paolo Missier

More from Paolo Missier (20)

Recently uploaded

Recently uploaded (20)

Data Provenance and its role in Data Science

Editor's Notes