Invited talk at the April 18th-20th Data Science workshop in Islamabad, Pakistan
How provenance may help Data Science. State of the art and open challenges
2. DataScienceWorkshop
Islamabad,April2017
P.Missier
2
What is provenance?
Oxford English Dictionary:
• the fact of coming from some particular source or quarter; origin, derivation
• the history or pedigree of a work of art, manuscript, rare book, etc.;
• a record of the passage of an item through its various owners: chain of custody
Magna Carta (‘the Great Charter’) was
agreed between King John and his barons
on 15 June 1215.
3. DataScienceWorkshop
Islamabad,April2017
P.Missier
3
A PROV provenance graph
3
Editing phase
drafting commenting editingused
draft
v1
wasGeneratedBy used draft
comments
wasGeneratedBy used draft
v2
wasGeneratedBy
BobBob-1 Bob-2
specializationOf
wasAssociatedWith
specializationOf
wasAssociatedWith
reading
wasDerivedFrom
paper3
used
Alice
wasAssociatedWith
actedOnBehalfOf
wasDerivedFrom
Remote past Recent past
wasGeneratedBy
distribution=internal
status=draft
version=0.1
distribution=internal
status=draft
version=0.1
type=person
role=main_editortype=person
role=jr_editor
role=author
role=editor
role=author
wasAttributedTo
Publishing phase
guideline
update
publication
draft
v2
used
WD1
pub
guidelines
v1
wasGeneratedBy pub
guidelines
v2
wasGeneratedBy
wasDerivedFrom
Charlie
wasAssociatedWith
Alice
actedOnBehalfOf
w3c:
consortium
wasAssociatedWith
distribution=public
status=draft
version=1.0
type=person
role=headOfPublication
type=institution
role=issuer
3
Editing phase
drafting commenting editingused
draft
v1
wasGeneratedBy used draft
comments
wasGeneratedBy used draft
v2
wasGeneratedBy
BobBob-1 Bob-2
specializationOf
wasAssociatedWith
specializationOf
wasAssociatedWith
reading
wasDerivedFrom
paper3
used
Alice
wasAssociatedWith
actedOnBehalfOf
wasDerivedFrom
Remote past Recent past
wasGeneratedBy
distribution=internal
status=draft
version=0.1
distribution=internal
status=draft
version=0.1
type=person
role=main_editortype=person
role=jr_editor
role=author
role=editor
role=author
wasAttributedTo
Publishing phase
guideline
update
publication
draft
v2
used
WD1
pub
guidelines
v1
wasGeneratedBy pub
guidelines
v2
wasGeneratedBy
wasDerivedFrom
Charlie
wasAssociatedWith
Alice
actedOnBehalfOf
w3c:
consortium
wasAssociatedWith
distribution=public
status=draft
version=1.0
type=person
role=headOfPublication
type=institution
role=issuer
4. DataScienceWorkshop
Islamabad,April2017
P.Missier
4
The W3C Working Group on Provenance
W3C
Incubator group
on provenance
Chair: Yolanda Gil,
ISI, USC
W3C
working group
approved
Chairs:
Luc Moreau,
Paul Groth
2009-2010
Main output:
“Provenance XG Final Report”
http://www.w3.org/2005/Incubator/prov/XGR-prov/
- provides an overview of the various existing
approaches, vocabularies
- proposes the creation of a dedicated W3C Working
Group
April, 2011 April, 2013
Proposed
Recommendations
finalised
prov-dm: Data Model
prov-o: OWL ontology, RDF encoding
prov-n: prov notation
prov-constraints
...plus a number of non-prescriptive
Notes
http://www.w3.org/2011/prov/wiki/
5. DataScienceWorkshop
Islamabad,April2017
P.Missier
5
PROV: scope and structure
source: http://www.w3.org/TR/prov-overview/
Recommendation
track
See also:
Moreau, Luc, and Paul Groth. “Provenance: An Introduction to PROV.” Synthesis Lectures on the
Semantic Web: Theory and Technology 3, no. 4 (September 15, 2013): 1–129.
doi:10.2200/S00528ED1V01Y201308WBE007.
7. DataScienceWorkshop
Islamabad,April2017
P.Missier
7
Same example — PROV-O notation
:draftComments a prov:Entity ;
:distr "internal"^^xsd:string ;
prov:wasGeneratedBy :commenting .
:commenting a prov:Activity ;
prov:used :draftV1 .
:draftV1 a prov:Entity ;
:distr "internal"^^xsd:string ;
:status "draft"^^xsd:string ;
:version "0.1"^^xsd:string ;
prov:wasGeneratedBy :drafting .
:drafting a prov:Activity ;
prov:used :paper1,
:paper2 .
:paper1 a prov:Entity,
"reference"^^xsd:string .
:paper2 a prov:Entity,
"reference"^^xsd:string .
(RDF / Turtle notation)
8. DataScienceWorkshop
Islamabad,April2017
P.Missier
8
Association, Attribution, Delegation: who did what?
An activity association is an assignment of responsibility to an agent for an activity,
indicating that the agent had a role in the activity.
Attribution is the ascribing of an entity to an agent.
entity(ex:draftComments, [ ex:distr='internal' ])
activity(ex:commenting)
agent(ex:Bob, [prov:type = "mainEditor"] )
agent(ex:Alice, [prov:type = "srEditor"])
wasAssociatedWith(ex:commenting, Bob, -, [prov:role = "editor"])
actedOnBehalfOf(Bob, Alice)
wasAttributedTo(ex:draftComments, ex:Bob)
9. DataScienceWorkshop
Islamabad,April2017
P.Missier
9
Same example — PROV-O notation (RDF/N3)
:Alice a prov:Agent,
"ex:chiefEditor";
:firstName "Alice";
:lastName "Cooper".
:Bob a prov:Agent,
"ex:seniorEditor";
:firstName "Robert";
:lastName "Thompson"^;
prov:actedOnBehalfOf :Alice .
:draftComments prov:wasAttributedTo :Bob .
:drafting a prov:Activity ;
prov:wasAssociatedWith :Bob .
12. DataScienceWorkshop
Islamabad,April2017
P.Missier
12
Derivation amongst entities
A derivation is a transformation of an entity into another, an update of an entity
resulting in a new one, or the construction of a new entity based on a pre-existing
entity.
entity(ex:draftV1)
entity(ex:draftComments)
wasDerivedFrom(ex:draftComments, ex:draftV1)
Q.: what is the relationship between derivation, generation, and usage?
:draftComments a prov:Entity ;
prov:wasDerivedFrom :draftV1 .
:draftV1 a prov:Entity .
13. DataScienceWorkshop
Islamabad,April2017
P.Missier
13
From “scruffy” provenance to “valid” provenance
- Are all possible temporal partial ordering of events equally acceptable?
- How can we specify the set of all valid orderings?
- how do we formally define what it means for a set of provenance
statements to be valid?
PROV defines a set of temporal constraints that ensure consistency
of a provenance graph
14. DataScienceWorkshop
Islamabad,April2017
P.Missier
14
Talk Outline
• Provenance, why? (in science)
• Provenance of Scientific Data
• The DataONE Federation of Data Repositories (dataone.org)
• Provenance for Data Science
• Provenance-enabled data analytics frameworks
• Provenance in the ReComp project
• (Provenance for streaming data analytics)
15. DataScienceWorkshop
Islamabad,April2017
P.Missier
15
Talk Outline
• Provenance, why? (in science)
• Provenance of Scientific Data
• The DataONE Federation of Data Repositories (dataone.org)
• Provenance for Data Science
• Provenance-enabled data analytics frameworks
• Provenance in the ReComp project
• Provenance for streaming data analytics
16. DataScienceWorkshop
Islamabad,April2017
P.Missier
16
Why provenance?
Provenance in machine learning:
• Why is my predictive algorithm recommending these new friends to me?
• How can I trust my classifier’s predictions?
[1] Ceolin, D., Groth, P., Maccatrozzo, V., Fokkink, W., Hage, W. R. Van, & Nottamkandath, A. (2016).
Combining User Reputation and Provenance Analysis for Trust Assessment. J. Data and
Information Quality, 7(1–2), 6:1--6:28. http://doi.org/10.1145/2818382
• Reproducibility of your own and your peers’ work
• i.e. in experimental Science
Example: assessing trust in Web artifacts and crowdsourced annotations [1]
• Communication:
To engender trust in the data and amongst the people and systems that are
responsible for it
• Understandability:
• to explain the outcome of a complex decision process
17. DataScienceWorkshop
Islamabad,April2017
P.Missier
17
Trusted Web data: Provenance on the Web
Tim Berners-Lee’s “Oh Yeah” button:
http://users.ugent.be/~tdenies/OhYeah/
Easy Access to Provenance: an Essential Step Towards Trust on the Web, Procs METHOD 2013: The 2nd IEEE
International Workshop on Methods for Establishing Trust with Open Data Held in conjunction with COMPSAC,
the IEEE Signature Conference on Computers, Software & Applications - July 22-26, 2013 - Kyoto, Japan
http://dx.doi.org/10.1109/COMPSACW.2013.29
18. DataScienceWorkshop
Islamabad,April2017
P.Missier
18
Understandability: explaining process outcomes
• Which process was used to derive a
diagnosis?
• How did the process use the input
data?
• How were the steps configured?
• Which decisions were made by
human experts (clinicians)?
MAF threshold
- Non-synonymous
- stop/gain
- frameshift
known polymorphisms
Homo / Heterozygous
Pathogenicity
predictors
Variant filtering
HPO match
HPO to OMIM
OMIM match
OMIM to Gene
Gene
Union
Gene
Intersect
Genes in scope
User-supplied
genes list
User-supplied
disease keywords
User-defined
preferred genes
Variant Scoping
Candidate
variants
Select
variants
in scope
variants
in scope
ClinVar
lookupClinVar
Annotated
patient
variants
Variant Classification
RED:
found,
pathogenic
AMBER:
not found
GREEN:
found,
benign
OMIM
AMBER/
not found
AMBER/
uncertain
NGS
pipeline
Clinical diagnosis of genetic diseases
Example provenance query:
“Find all invocations that used a
specific version of ClinVar and OMIM,
and group them by phenotype”
19. DataScienceWorkshop
Islamabad,April2017
P.Missier
19
Reproducibility and dissemination in Science
Experimental science is data-intensive
Independent validation of results claims is a
cornerstone of scientific discourse
Provenance is the equivalent of a formal logbook
• Capture all steps involved in the derivation of a
result
• How much detail?
• Replay, validate, compare
22. DataScienceWorkshop
Islamabad,April2017
P.Missier
22
Re-what?
Re-*
ReRun:
vary experiment and setup, same lab
P P’
DD’
depdep’
Repeat:
Same experiment, setup, lab
P, D, dep, env(dep)
Replicate:
Same experiment, setup, different lab
P, D, dep, env’(dep)
Reproduce:
vary experiment and setup, different lab
P P’
DD’
depdep’
env(dep) env’(dep’)
Reuse:
Different experiment
D, P Q
23. DataScienceWorkshop
Islamabad,April2017
P.Missier
23
Lifecycle with tools annotations
Search
discover
packagepublish
D D1
P P’
dep dep’
compute
Env
D’
prov(D’)
Compare
(P,P’,D,D’)
spec(P)
prov(D)
Research
Objects
DataONE
Federated
Research Data
Repositories
TOSCA-based
virtualisation
Pdiff: differencing
provenance
YesWorkflow
- Workflow
Provenance
- NoWorkflow
Matlab
provenance
recorder
(DataONE)
ReproZip
spec(P’)
Deploy
P’
Env(dep’)
24. DataScienceWorkshop
Islamabad,April2017
P.Missier
24
References
Research Objects: www.researchobject.org
Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et
al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011).
doi:doi:10.1016/j.future.2011.08.004.
DataONE: dataone.org
Cuevas-Vicenttín, Víctor, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati,
Yaxing Wei, David Koop, and Saumen Dey. “The PBase Scientific Workflow Provenance Repository.”
In Procs. 9th International Digital Curation Conference, 9:28–38. San Francisco, CA, USA, 2014.
doi:10.2218/ijdc.v9i2.332.
Process Virtualisation using TOSCA
Qasha, Rawaa, Jacek Cala, and Paul Watson. “Towards Automated Workflow Deployment in the Cloud Using
TOSCA.” In 2015 IEEE 8th International Conference on Cloud Computing, 1037–1040. New York, 2015.
doi:10.1109/CLOUD.2015.146.
NoWorkflow: provenance recording for Python
Murta, Leonardo, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire.
“noWorkflow: Capturing and Analyzing Provenance of Scripts⋆.” In Procs. IPAW’14. Cologne,
Germany: Springer, 2014.
YesWorkflow: Qian Zhang, Yang Cao, Qiwen Wang, Duc Vu, Priyaa Thavasimani, Timothy
McPhillips, Paolo Missier, Bertram Ludäscher, Revealing the Detailed Lineage of Script Outputs using
Hybrid Provenance, Procs. International Data Curation Conference, DCC, Edinburgh, 2017.
25. DataScienceWorkshop
Islamabad,April2017
P.Missier
25
PDIFF Comparing provenance traces
15
A graph obtained as a result of traces “diff”
which can be used to explain observed differences in workflow outputs, in
terms of differences throughout the two executions.
This is the simplest
possible delta “graph”!d1
S0
d2
S1
z w
S2
d3
yx
S3
S4
df
d1'
S0
d2'
S1
z w'
S2
d3
y'x
S3
S4
df'
(i) Trace A (ii) Trace B
P0 P1
P0 P1
P0 P1
P0 P1
dF , dF
y, y
w, w
d2, d2
(iii) Delta tree
Two executions of the same workflow, with slight differences:
- Unintentional changes: eg incorrect porting/re-deployment cause analysis
- Intentional changes: eg different parameter settings impact analysis
Missier, P., Woodman, S., Hiden, H., & Watson, P. (2013). Provenance and data differencing for
workflow reproducibility analysis. Concurrency and Computation: Practice and Experience, 28(4),
995–1015. http://doi.org/10.1002/cpe.3035
26. DataScienceWorkshop
Islamabad,April2017
P.Missier
26
Components for a flexible, scalable,
sustainable network
DataONE: Cyberinfrastructure
www.dataone.org/member-nodes
Coordinating Nodes
• retain complete
metadata catalog
• indexing for search
• network-wide services
• ensure content
availability
(preservation)
• replication services
Member Nodes
• diverse institutions
• serve local community
• provide resources for
managing their data
• retain copies of data
29. DataScienceWorkshop
Islamabad,April2017
P.Missier
29
ProvONE: extending PROV with process structure
https://purl.dataone.org/provone-v1-dev
Yang Cao, Christopher Jones, Vıctor Cuevas-Vicenttın, Matthew B. Jones, Bertram Ludascher, Timothy
McPhillips, Paolo Missier, Christopher Schwalm, Peter Slaughter, Dave Vieglais, Lauren Walker, Yaxing Wei,
ProvONE: extending PROV to support the DataONE scientific community, TAPP workshop on Theory and
Practice of Provenance, 2016.
Workflow structure
Retrospective provenance
31. DataScienceWorkshop
Islamabad,April2017
P.Missier
31
Database provenance
• Why is record R included in the result of query? [why-provenance] (*)
(*) Cheney, J., Chiticariu, L., & Tan, W.-C. (2009). Provenance in Databases: Why, How, and Where.
Foundations and Trends in Databases, 1, 379–474.
(+) Herschel, M., & Hernández, M. A. (2010). Explaining missing answers to SPJUA queries.
Proceedings of the VLDB Endowment, 3(1–2), 185–196. http://doi.org/10.14778/1920841.1920869
SIGMOD Tutorial 2007 Provenance in Databases 7
Example of Data Provenance
n A typical question:
n For a given database query Q, a database D, and a tuple t in the output
of Q(D), which parts of D “contribute” to t?
n The provenance of tuple (John, D01, Mary) in the output consists of
the source facts R(John, D01) and S(D01, Mary) according to the
query Q.
n The question could also be applied to an attribute value, a table, or
any subtree in hierarchical/tree-like data.
R
Emp Dept
John D01
Susan D02
Anna D04
S
Did Mgr
D01 Mary
D02 Ken
D03 Ed Q = select r.A, r.B, s.C
from R r, S s
where r.B = s.B
Q
Emp Dept Mgr
John D01 Mary
Susan D02 Ken
• Why is record R not in the result? [why-not provenance] (+)
Source: Provenance in Databases: Past, Current, Future. Peter Buneman University of Edinburgh. Wang-Chiew
Tan UC Santa Cruz. SIGMOD Tutorial 2007
32. DataScienceWorkshop
Islamabad,April2017
P.Missier
32
Talk Outline
• Provenance, why? (in science)
• Provenance, of Scientific Data: The DataONE Federation of Data
Repositories (dataone.org)
• Provenance for Data Science
• Provenance-enabled data analytics frameworks
• Provenance in the ReComp project
• (Provenance for streaming data analytics)
33. DataScienceWorkshop
Islamabad,April2017
P.Missier
33
Provenance from analytics
Analytics data processing generates potentially valuable knowledge
But:
Credibility of the outcomes requires
- Understandability of data processing
human-oriented: "Prospective provenance"
- Provenance recording and query
machine-oriented: PROV XML, RDF, Neo4J data + query models
Problem:
- Process complexity, lack of transparency
- Provenance is coarse-grained or complex to understand
34. DataScienceWorkshop
Islamabad,April2017
P.Missier
34
Some research prototypes
Titian (*)
Apache Spark maintains the program
transformation lineage to recover from
failures
• Titian enhances the Spark RDD
programming model
• data provenance capture
• interactive query support
(*) Interlandi, M., Shah, K., Tetali, S. D., Gulzar, M. A., Yoo, S., Kim, M., Condie, T. (2015). Titian: Data Provenance
Support in Spark. Proc. VLDB Endow., 9(3), 216–227. http://doi.org/10.14778/2850583.2850595
Lipstick on a pig (+)(++)
A framework that marries database-
style and workflow provenance models
The catch… all modules must be
implemented in Pig Latin
(+) Amsterdamer, Y., Davidson, S. B., Deutch, D., Milo, T., Stoyanovich, J., & Tannen, V. (2011). Putting lipstick on
pig: enabling database-style workflow provenance. Proc. VLDB Endow., 5(4), 346–357.
http://dl.acm.org/citation.cfm?id=2095686.2095693
(++) Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (2008). Pig latin: a not-so-foreign language for
data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data
(pp. 1099–1110). New York, NY, USA: ACM. http://doi.org/http://doi.acm.org/10.1145/1376616.1376726
36. DataScienceWorkshop
Islamabad,April2017
P.Missier
36
Provenance for analytics: Titian
Apache Spark natively maintains the program transformation lineage so that it can
reconstruct lost RDD partitions in the case of a failure
• Titian enhances it with data provenance capture and interactive query support
that extends the Spark RDD programming model
• With limited overhead of less than 30%
[1] Interlandi, M., Shah, K., Tetali, S. D., Gulzar, M. A., Yoo, S., Kim, M., Condie, T. (2015). Titian: Data Provenance
Support in Spark. Proc. VLDB Endow., 9(3), 216–227. http://doi.org/10.14778/2850583.2850595
LineageRDD methods for traversing through the data
lineage in both backward and forward directions [1] Job workflow after adding the lineage capture points
37. DataScienceWorkshop
Islamabad,April2017
P.Missier
37
Provenance for analytics: “Lipstick on a pig”
[2] Amsterdamer, Y., Davidson, S. B., Deutch, D., Milo, T., Stoyanovich, J., & Tannen, V. (2011). Putting lipstick on
pig: enabling database-style workflow provenance. Proc. VLDB Endow., 5(4), 346–357.
http://dl.acm.org/citation.cfm?id=2095686.2095693
[3] Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (2008). Pig latin: a not-so-foreign language for
data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data
(pp. 1099–1110). New York, NY, USA: ACM. http://doi.org/http://doi.acm.org/10.1145/1376616.1376726
“A framework that marries database-style and workflow provenance models
capturing internal state as well as fine-grained dependencies in workflow provenance”
The catch… all modules must be
implemented in Pig Latin:
“an emerging language that
combines high-level declarative
querying with low-level procedural
programming and parallelization in
the style of map-reduce” [3]
38. DataScienceWorkshop
Islamabad,April2017
P.Missier
38
Provenance for analytics: Map-Reduce
[4] Ikeda, R., Park, H., & Widom, J. (2011). Provenance for generalized map and reduce workflows. In: CIDR 2011.
Scope: generalized map and reduce workflows (GMRWs)
• input data sets are processed by an acyclic graph of map and reduce functions
Transformations: Map, Reduce, Union, Split
• Each transformation has an associated provenance operator
RAMP (Reduce And Map Provenance), an extension to Hadoop that transparently
wraps each function with provenance capture:
39. DataScienceWorkshop
Islamabad,April2017
P.Missier
39
Provenance for analytics: Map-Reduce
[5] Crawl, D., Wang, J., & Altintas, I. (2011). Provenance for MapReduce-based data-intensive workflows. In
Proceedings of the 6th workshop on Workflows in support of large-scale science (pp. 21–30).
Kepler+Hadoop framework
Works for Kepler workflows that invoke Hadoop jobs
Capture provenance inside the MapReduce job as
40. DataScienceWorkshop
Islamabad,April2017
P.Missier
40
Provenance for analytics: Map-Reduce
[6] Murray, D. G., McSherry, F., Isard, M., Isaacs, R., Barham, P., & Abadi, M. (2016). Incremental, iterative data
processing with timely dataflow. Communications of the ACM, 59(10), 75–83. http://doi.org/10.1145/2983551
Extend the MapReduce framework with change propagation
• The framework keeps track of the
dependencies between subsets of
each MapReduce computation
• When a subset of the input
changes, rebuilds only the parts of
the computation and the output
affected by the changes
41. DataScienceWorkshop
Islamabad,April2017
P.Missier
41
Challenges
• Too little, too much provenance
• Not at the right level of abstraction
Complex, black box analytics coarse-grained provenance
White-box analytics frameworks fine-grained, too detailed
• need abstraction / view mechanisms!
Ad hoc analytics (eg Python, R) ??
Need flexible, user-level provenance capture from complex analytics
42. DataScienceWorkshop
Islamabad,April2017
P.Missier
42
Additional recent research on Provenance and Big Data
Chen, Peng; Plale, Beth A., "Big Data Provenance Analysis and Visualization," Cluster, Cloud and Grid
Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.797,800, 4-7
May 2015 doi: 10.1109/CCGrid.2015.85
Chen, Peng; Plale, Beth A., "ProvErr: System Level Statistical Fault Diagnosis Using Dependency
Model," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium
on , vol., no., pp.525,534, 4-7 May 2015
doi: 10.1109/CCGrid.2015.86
Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs
Peter Macko and Margo Seltzer, Harvard University, Procs. TAPP’11, 2011, Crete, Greece
Provenance from Log Files: a BigData Problem, Devarshi Ghoshal and
Beth Plale, Procs. BigProv workshop, EDBT, Genova, Italy, 2013
Adam Bates, Kevin Butler and Thomas Moyer. Take Only What You Need: Leveraging Mandatory
Access Control Policy to Reduce Provenance Storage Costs. In Procs. TAPP’15 workshop,
Edinburgh, 2015
http://workshops.inf.ed.ac.uk/tapp2015/TAPP15_II_3.pdf
43. DataScienceWorkshop
Islamabad,April2017
P.Missier
43
Talk Outline
• Provenance, why? (in science)
• Provenance of Scientific Data
• The DataONE Federation of Data Repositories (dataone.org)
• Provenance for Data Science
• Provenance-enabled data analytics frameworks
• Provenance in the ReComp project
• (Provenance for streaming data analytics)
47. DataScienceWorkshop
Islamabad,April2017
P.Missier
47
Goal: Minimise re-comp effort across all prior outcomes
Objective:
• Reduce the amount of computation performed in reaction to changes
Constraint: selective re-computation should be lossless
• All instances that may be subject to impact must be considered
• P1: Partial re-execution
• Reduce re-computation to only those parts of a process that are actually
involved in the processing of the changed data
• P2: Differential execution
• Insight: If an instance I of process P is executed using the delta between two
versions of the inputs and it produces empty result, then I is not affected by the
version change
• Only feasible if some algebraic properties of the process hold
• P3: Identifying the scope of change
• Determine which instances out of a population of outcomes are going to be
affected by the change
50. DataScienceWorkshop
Islamabad,April2017
P.Missier
50
P1: Partial re-execution
Objective: find the minimal sub-graph of a workflow that is affected by a change
Approach:
• e-SC generates one ProvONE provenance trace for each workflow run
• Use traces to identify the minimal sub-workflow that is affected by the change
:- invocation(I), wasPartOf(A,I),
wasDerivedFrom(d’,Dep),used(A,Dep)
Initial query: given a new version d’ of a reference database d:
Variable I denotes one execution instance (workflow invocation)
Traversal queries:
:- execution(A1), execution(A2),
wasPartOf(A,I), wasInformedBy(A2,A1)
collect all other activities
connected to A (implicit
data transfer)
:- execution(A1), execution(A2),
wasGeneratedBy(Data,A1), used(A2,Data)
collect all other activities
connected to A (explicit
data transfer D from A1
to A2)
Find all activities A within
invocation I, that used a
prior version Dep of d’
52. DataScienceWorkshop
Islamabad,April2017
P.Missier
52
Results
• How much can we save?
• Process structure
• First usage of reference data
• Overhead: storing interim data required in partial re-execution
• 156 MB for GeneMap changes and 37 kB for ClinVar changes
Time savings Partial
re-execution
(sec)
Complete
re-execution
(sec)
Time saving (%)
GeneMap 325 455 28.5
ClinVar 287 455 37
53. DataScienceWorkshop
Islamabad,April2017
P.Missier
53
P2: Differential execution
Suppose D is a relation (a table). diffD() can be expressed as:
Where:
The idea is to compute:
as the combination of:
This is effective if:
If the operators that make up P satisfy certain algebraic properties(*), then this
can be achieved as follows:
(*) Associative, distributive over set union and difference
54. DataScienceWorkshop
Islamabad,April2017
P.Missier
54
P2: Partial re-computation using input difference
Insight: run SVI but replace ClinVar query with a query on ClinVar version diff:
Q(CV) Q(diff(CV1, CV2))
Works for SVI, but hard to generalise: depends on the type of process
Gain: diff(CV1, CV2) much smaller than CV2
GeneMap versions
from –> to
ToVersion rec.
count
Difference
rec. count Reduction
16-03-08 –> 16-06-07 15910 1458 91%
16-03-08 –> 16-04-28 15871 1386 91%
16-04-28 –> 16-06-01 15897 78 99.5%
16-06-01 –> 16-06-02 15897 2 99.99%
16-06-02 –> 16-06-07 15910 33 99.8%
ClinVar versions
from –> to
ToVersion rec.
count
Difference
rec. count Reduction
15-02 –> 16-05 290815 38216 87%
15-02 –> 16-02 285042 35550 88%
16-02 –> 16-05 290815 3322 98.9%
55. DataScienceWorkshop
Islamabad,April2017
P.Missier
55
P3: Identifying the scope of change: a game of battleship
Patient / change impact matrix
Challenge:
precisely identify the scope of a change
Blind reaction to change: recompute the entire matrix
Can we do better?
- Hit the high impact cases (the X) without re-
computing the entire matrix
56. DataScienceWorkshop
Islamabad,April2017
P.Missier
56
A scoping algorithm
Coarse-grained provenance indicates whether or not a dependency on D existed
… but not which specific data from version Dt of D was used
Candidate invocation:
Any invocation I of P whose provenance contains statements of the form:
used(A,Dt), wasPartOf(A,I), wasAssociatedWith(I,_,P)
used(I,Dt),wasAssociatedWith(I,_,P)
- For each candidate invocation I of P:
- partially re-execute using the difference sets as inputs
- compute the minimal subgraph P’ of P that needs re-computation
- repeat:
execute P’ one step at-a-time
until <empty output> or <P’ completed>
- If <P’ completed> then
- Execute P’ on the full inputs
Sketch of the algorithm (simplified):
58. DataScienceWorkshop
Islamabad,April2017
P.Missier
58
Scoping: efficiency
Total re-computation time for the whole patient cohort
0
1
2
3
4
5
Executiontime[hours]
ClinVar update date [mm/yy]
CV blind CV selective process
CV selective scope, δ-gen CV selective scope, δ-SVI
We expect to pay a penalty for running the algorithm when the difference sets are
large compared to actual new data
More accurate diff functions result in higher runtime savings
• Smaller difference sets
• More precise scoping
59. DataScienceWorkshop
Islamabad,April2017
P.Missier
59
Provenance in ReComp: Summary and Challenges
Objective:
• Reduce the amount of computation performed in reaction to changes
1. Partial re-execution of previously computed workflows
2. (Differential execution)
3. Identifying the scope of change
• Makes use of (2) to determine which instances of a population of outcomes are
going to be affected by a change
Challenges / work in progress:
• Validate and extend the approach to other case studies
• Design estimators to predict the impact of change
• Design and implement a generic ReComp meta-process
• Observe P in execution
• Detect changes
• Selectively react to changes
60. DataScienceWorkshop
Islamabad,April2017
P.Missier
60
Talk Outline
• Provenance, why? (in science)
• Provenance, of what?
• Of Scientific Data: The DataONE Federation of Data Repositories (dataone.org)
• Of database data (very briefly)
• Of Web data the W3C PROV data model for provenance
• Provenance for Data Science
• Provenance-enabled data analytics frameworks
• Provenance in the ReComp project
• Provenance for streaming data analytics
61. DataScienceWorkshop
Islamabad,April2017
P.Missier
61
Process-specific provenance using templates
Process
definition:
P: X --> Y
Provenance
template T
Sidecar binding
process definition:
PB: <xi,yi,P> --> B
Execution:
xi --> Pi --> yi
<xi,yi,P> PB B
Apply(B,T)
PROV
document
P
Aim: enable provenance generation from black box analytics
Approach:
1) human-oriented: "Prospective provenance" --> YesWorkflow
2) machine-oriented: PROV --> XML, RDF, Neo4J data + query models
The resulting provenance is
• Application-level
• User-specified
• Coarse-grained
62. DataScienceWorkshop
Islamabad,April2017
P.Missier
62
Illustration: the case of map()
[y1 … yn] = map(λ x.f(x), [x1 … xn])
Process definition:
map: <X, lambda x: f(x)> --> Y
Provenance
template T
:x :y
f
:a
Execution:
[y1 … yn] = map(lambda x: f(x), [x1 … xn])
:y. :a. f :x
Binding:
B = { <:x ← x1, :y ← y1, :a ← map1>,
…
<:x ← xn, :y ← yn, :a ← mapn> }
:x1 :map1 :y1
Type:plan
Type:plan
f
:xn :yn:map1
Apply(B,T)
Sidecar process BP
used
used
used
gen
gen
gen
assoc
assoc
64. DataScienceWorkshop
Islamabad,April2017
P.Missier
64
Application of template approach to streaming analytics
Data in movement is a prime source for value-added analytics applications
- Eg data streams from Internet of Things devices
The provenance of an output data stream is a stream of provenance statements
The template / sidecar process / binding framework applies without changes:
Following the Spark streaming model:
A stream is discretised into a sequence of micro-batch intervals
Window W = [ x1 … xk]: user-defined sequence of intervals xi
Process P: Wi Y operates on a window at a time
(Y may be a sequence or other multivariate data structure)
65. DataScienceWorkshop
Islamabad,April2017
P.Missier
65
Provenance streams
:w :y:a
used gen
A simple template: ”each :y is generated using the content of one window”
B1
P , B2
P , … invoked alongside each invocation of P1, P2 ,…
Each Bi
P produces a binding Bi
Each Bi is applied to the template, producing a PROV document Provi
… this results in a stream of provenance alongside the stream of outputs yi
Execution:
wi = [xi1 … xin]
y1 = P1 (w1) B1 = B1
P (w1, P1, y1)
y2 = P2 (w2) B2 = B2
P (w2, P2, y2)
…
:y :a :w
Prov1 = apply(B1,T)
Prov2 = apply(B2,T)
…
66. DataScienceWorkshop
Islamabad,April2017
P.Missier
66
Extensions
The framework applies to a stateful process P
- P’s outcome depends on an internal state S
- P’s execution may modify S
Hint: S is itself an entity with provenance, defined by its own template and bindings
Add flexibility by allowing multiple templates and multiple sidecar processes for the
same process execution
The actual provenance output at the end of the process:
Prov = apply(B,T)
is an extensional representation.
< T, B> is an intensional representation
More space-efficient (T is only stored once, B is a set of variable-value pairs)
Only serialised when needed
67. DataScienceWorkshop
Islamabad,April2017
P.Missier
67
Summary
• Provenance, why? (in science)
• Provenance, of Scientific Data: The DataONE Federation of Data
Repositories (dataone.org)
• Provenance for Data Science
• Provenance-enabled data analytics frameworks
• Provenance in the ReComp project
• (Provenance for streaming data analytics)
69. DataScienceWorkshop
Islamabad,April2017
P.Missier
69
Selected bibliography
Moreau, Luc, Paolo Missier, Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell,
et al. PROV-DM: The PROV Data Model. Edited by Luc Moreau and Paolo Missier, 2012.
http://www.w3.org/TR/prov-dm/
Cheney, James, Paolo Missier, and Luc Moreau. Constraints of the Provenance Data Model, 2012.
http://www.w3.org/TR/prov-constraints/
Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. “The Rationale of PROV.” Web
Semantics: Science, Services and Agents on the World Wide Web (April 2015).
doi:10.1016/j.websem.2015.04.001.
http://www.sciencedirect.com/science/article/pii/S1570826815000177
Marinho, Anderson, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio Manuel Serra da Cruz,
Eduardo Ogasawara, and Marta Mattoso. “ProvManager: a Provenance Management System for Scientific
Workflows.” Concurrency and Computation: Practice and Experience 24, no. 13 (2012): 1513–1530.
http://dx.doi.org/10.1002/cpe.1870.
ProvGen: generating synthetic PROV graphs with predictable structure.
Firth, H.; and Missier, P. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer
http://arxiv.org/pdf/1406.2495
ProvAbs: model, policy, and tooling for abstracting PROV graphs.
Missier, P.; Bryans, J.; Gamble, C.; Curcin, V.; and Danger, R. In Procs. IPAW 2014 (Provenance and
Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.1998
De Oliveira, Daniel, Vítor Silva, and Marta Mattoso. “How Much Domain Data Should Be in Provenance
Databases?” In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15). Edinburgh,
Scotland: USENIX Association, 2015. https://www.usenix.org/conference/tapp15/workshop-
program/presentation/de-oliveira.
Editor's Notes
W3C Recommendation (REC)
A W3C Recommendation is a specification or set of guidelines that, after extensive consensus-building, has received the endorsement of W3C Members and the Director. W3C recommends the wide deployment of its Recommendations. Note: W3C Recommendations are similar to the standards published by other organizations.
remark on PROV-AQ: nothing to do with querying, but a query model can be associated to each of the encodings
W3C Recommendation (REC)
A W3C Recommendation is a specification or set of guidelines that, after extensive consensus-building, has received the endorsement of W3C Members and the Director. W3C recommends the wide deployment of its Recommendations. Note: W3C Recommendations are similar to the standards published by other organizations.
Working Group Note
A Working Group Note is published by a chartered Working Group to indicate that work has ended on a particular topic. A Working Group may publish a Working Group Note with or without its prior publication as a Working Draft.
baseline-noAgents.provn
baseline-noAgents-unqual.n3
baseline-noAgents.provn
agents are software, organization, person -- non-normative
distinguish between normative and non-normative parts of the PROV documents
Examples of association between an activity and an agent are:
creation of a web page under the guidance of a designer;
various forms of participation in a panel discussion, including audience member, panelist, or panel chair;
a public event, sponsored by a company, and hosted by a museum;
baseline-noAgents-unqual.n3
agents are software, organization, person -- non-normative
distinguish between normative and non-normative parts of the PROV documents
Examples of association between an activity and an agent are:
creation of a web page under the guidance of a designer;
various forms of participation in a panel discussion, including audience member, panelist, or panel chair;
a public event, sponsored by a company, and hosted by a museum;
agents are software, organization, person -- non-normative
distinguish between normative and non-normative parts of the PROV documents
Examples of association between an activity and an agent are:
creation of a web page under the guidance of a designer;
various forms of participation in a panel discussion, including audience member, panelist, or panel chair;
a public event, sponsored by a company, and hosted by a museum;
A browser button by which the user can express their uncertainty about a document being displayed “so how do I know I can trust this information?”.
Upon activation of the button, the software then retrieves metadata about the document, listing assumptions on which trust can be based.
hdb_store_2:used(_, CV1Import, CV1, _, _), hdb_store_2:wasPartOf(CV1Import, WI1),
%hdb_store_2:execution(WI1, WIst, WIet, WIAttrs), parse_time(WIst, WIStartTS), date_time_stamp(date(2016, 12, 8), Today), WIStartTS @>= Today,
%hdb_store_2:execution(WI1, WIst, WIet, WIAttrs), split_string(WI1, "/", "", WITokens), last(WITokens, WInvId), number_string(WInvIdNo, WInvId), WInvIdNo > 55900,
%sub_string(CV1Attrs.get('prov:label'), 0, _, _, "variant_summary-"), sub_string(CV1Attrs.get('prov:label'), _, _, 0, "Del.csv"), hdb_store_2:used(_, CV1Import, CV1, _, _), hdb_store_2:wasPartOf(CV1Import, WI1),
hdb_store_2:wasPartOf(GM1Import, WI1), hdb_store_2:used(_, GM1Import, GM1, _, _), hdb_store_2:document(GM1, GM1Attrs),
split_string(GM1, "/", "", GM1Tokens), append(_, [GM1DocId, _], GM1Tokens), % append is to take the second to last element of the list
member(GM1DocId,
hdb_store_2:wasPartOf(Out1Export, WI1), hdb_store_2:wasGeneratedBy(_, Out1, Out1Export, _, _), hdb_store_2:document(Out1, Out1Attrs), Out1Attrs.get('prov:label') == "svi-classification.csv",
hdb_store_2:wasPartOf(PV1Import, WI1), PV1Import \= CV1Import, PV1Import \= GM1Import, hdb_store_2:used(_, PV1Import, PV1, _, _), hdb_store_2:document(PV1, PV1Attrs),
To what extent can these be formalised and automated?
Need to update with new / upcoming MN locations and logos
Amber notes:Retain CN, MN logo? Required if used elsewhere, if not cut?Not all MN logos will fit – select representative or cut?Cross reference with google MN
Rebecca:
Need updated logos for KNB, AOOS (FIXED) – I would select a different set of MNs to highlight since all won’t fit
Rebecca:
Can we do a better job than the quad chart? If not, are all the logos in
1st quadrant appropriate?
Update before RSV
Figure shows from 2020 – edit?
Data in movement (the Velocity dimension of Big Data) is gaining prominence as a prime source of data for analytics applications. Data streams generated by Internet of Things devices, for instance, are a rich source of implicit signals about the habits of individuals that operate those devices, e.g. in their smart homes, smart cars, through wearables, etc.
. The native Spark compute method is used to plug a LineageRDD instance into the Spark dataflow (described in Section 4).
Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
(note we need to restrict the query to the specific invocation I, in case other workflow may have used the output of I)
Also, as in Tab. 2 and 3 in the paper, I’d mention whether this reduction was possible with generic diff function or specific function tailored to SVI.
What is also interesting and what I would highlight is that even if the reduction is very close to 100% but below, the cost of recomputation of the process may still be significant because of some constant-time overheads related to running a process (e.g. loading data into memory). e-SC workflows suffer from exactly this issue (every block serializes and deserializes data) and that’s why Fig. 6 shows increase in runtime for GeneMap executed with 2 \deltas even if the reduction is 99.94% (cf. Tab. 2 and Fig. 6 for GeneMap diff between 16-10-30 –> 16-10-31).
Regarding the algorithm, you show the simplified version (Alg. 1). But please take also look on Alg. 2 and mention that you can only run the loop if the distributiveness holds for all P in the downstream graph. Otherwise, you need to break and re-execute on full inputs just after first non-distributive task produces a non-empty output. But, obviously, the hope is that with a well tailored diff function the output will be empty for majority of cases.
This figure emphasizes the penalty for running the algorithm when the differ- ence sets were large compared to actual new data. But it also highlights the importance of the diff and impact functions. Clearly, the more accurate the functions are the higher runtime savings may be, which stems from two fact. Firstly, more accurate diff function tends to pro- duce smaller difference sets which reduces time of task re-execution (cf. CV-diff and CV-SVI-diff lines in Fig. 7). Secondly, more accurate impact function tends to produce false more frequently, and so the algorithm can more of- ten avoid re-computation with the complete new version of the data (cf. the number of black squares vs the total number of patients affected by a change in Tab. 4)
Differential excution: If I can execute P again using the changes between old and new input, and the result is empty, then I may conclude that P did not use any of the elements in the difference set
Map is well defined / white box and well understood so a good starting point to appreciate the template and binding idea
The first element is a provenance template.
<show template T1>. --> show a graph using provtoolbox
then we have a Binding along with a binding generation process
<example B1>
Finally apply(B, T )
Show the final prov graph using provtoolbox
Spark Streaming uses a micro-batch architecture, whereby a streaming computation is achieved through a sequence of batch computations, each operating on a fragment of the stream defined by a configurable batch interval. Each micro-batch consists of a finite sequence of data structures (RDDs, Spark's main data abstraction), known as a DStream.
we may view a DStream within each batch as a list of values, like those we have used in our batch examples. Thus, just as it allows Spark Streaming to reuse most of its RDD transformation and action operators, micro-batching also makes our framework reusable for generating provenance over streams. Spark also supports windows as application-friendly abstractions over streams. A window is simply a sequence of contiguous micro-batches, and a breakdown of the stream into windows is specified in the usual way, namely by a combination of window size (the number of micro-batches) and sliding duration. In this setting, the template T is defined so that it ap-plies to each window, and the sidecar process executes once for each window, producing a stream of bindings B1, B2, . . . . Correspondingly, this triggers a sequence of calls: apply(B1, T ), apply(B2, T ), . . . , resulting in a stream of provenance statements alongside the corresponding data input and output streams.