SlideShare a Scribd company logo
1 of 69
DataScienceWorkshop
Islamabad,April2017
P.Missier
1
Data Provenance and its role in Data Science
Dr. Paolo Missier
School of Computing Science
Newcastle University, UK
Data Science Workshop
Islamabad, April 2017
DataScienceWorkshop
Islamabad,April2017
P.Missier
2
What is provenance?
Oxford English Dictionary:
• the fact of coming from some particular source or quarter; origin, derivation
• the history or pedigree of a work of art, manuscript, rare book, etc.;
• a record of the passage of an item through its various owners: chain of custody
Magna Carta (‘the Great Charter’) was
agreed between King John and his barons
on 15 June 1215.
DataScienceWorkshop
Islamabad,April2017
P.Missier
3
A PROV provenance graph
3
Editing phase
drafting commenting editingused
draft
v1
wasGeneratedBy used draft
comments
wasGeneratedBy used draft
v2
wasGeneratedBy
BobBob-1 Bob-2
specializationOf
wasAssociatedWith
specializationOf
wasAssociatedWith
reading
wasDerivedFrom
paper3
used
Alice
wasAssociatedWith
actedOnBehalfOf
wasDerivedFrom
Remote past Recent past
wasGeneratedBy
distribution=internal
status=draft
version=0.1
distribution=internal
status=draft
version=0.1
type=person
role=main_editortype=person
role=jr_editor
role=author
role=editor
role=author
wasAttributedTo
Publishing phase
guideline
update
publication
draft
v2
used
WD1
pub
guidelines
v1
wasGeneratedBy pub
guidelines
v2
wasGeneratedBy
wasDerivedFrom
Charlie
wasAssociatedWith
Alice
actedOnBehalfOf
w3c:
consortium
wasAssociatedWith
distribution=public
status=draft
version=1.0
type=person
role=headOfPublication
type=institution
role=issuer
3
Editing phase
drafting commenting editingused
draft
v1
wasGeneratedBy used draft
comments
wasGeneratedBy used draft
v2
wasGeneratedBy
BobBob-1 Bob-2
specializationOf
wasAssociatedWith
specializationOf
wasAssociatedWith
reading
wasDerivedFrom
paper3
used
Alice
wasAssociatedWith
actedOnBehalfOf
wasDerivedFrom
Remote past Recent past
wasGeneratedBy
distribution=internal
status=draft
version=0.1
distribution=internal
status=draft
version=0.1
type=person
role=main_editortype=person
role=jr_editor
role=author
role=editor
role=author
wasAttributedTo
Publishing phase
guideline
update
publication
draft
v2
used
WD1
pub
guidelines
v1
wasGeneratedBy pub
guidelines
v2
wasGeneratedBy
wasDerivedFrom
Charlie
wasAssociatedWith
Alice
actedOnBehalfOf
w3c:
consortium
wasAssociatedWith
distribution=public
status=draft
version=1.0
type=person
role=headOfPublication
type=institution
role=issuer
DataScienceWorkshop
Islamabad,April2017
P.Missier
4
The W3C Working Group on Provenance
W3C
Incubator group
on provenance
Chair: Yolanda Gil,
ISI, USC
W3C
working group
approved
Chairs:
Luc Moreau,
Paul Groth
2009-2010
Main output:
“Provenance XG Final Report”
http://www.w3.org/2005/Incubator/prov/XGR-prov/
- provides an overview of the various existing
approaches, vocabularies
- proposes the creation of a dedicated W3C Working
Group
April, 2011 April, 2013
Proposed
Recommendations
finalised
prov-dm: Data Model
prov-o: OWL ontology, RDF encoding
prov-n: prov notation
prov-constraints
...plus a number of non-prescriptive
Notes
http://www.w3.org/2011/prov/wiki/
DataScienceWorkshop
Islamabad,April2017
P.Missier
5
PROV: scope and structure
source: http://www.w3.org/TR/prov-overview/
Recommendation
track
See also:
Moreau, Luc, and Paul Groth. “Provenance: An Introduction to PROV.” Synthesis Lectures on the
Semantic Web: Theory and Technology 3, no. 4 (September 15, 2013): 1–129.
doi:10.2200/S00528ED1V01Y201308WBE007.
DataScienceWorkshop
Islamabad,April2017
P.Missier
6
PROV notation (PROV-N)
document
prefix prov <http://www.w3.org/ns/prov#>
prefix ex <http://www.example.com/>
entity(ex:draftComments)
entity(ex:draftV1, [ ex:distr='internal', ex:status = "draft"])
entity(ex:paper1)
entity(ex:paper2)
activity(ex:commenting)
activity(ex:drafting)
wasGeneratedBy(ex:draftComments, ex:commenting, 2013-03-18T11:10:00)
used(ex:commenting, ex:draftV1, -)
wasGeneratedBy(ex:draftV1, ex:drafting, -)
used(ex:drafting, ex:paper1, -)
used(ex:drafting, ex:paper2, -)
endDocument
DataScienceWorkshop
Islamabad,April2017
P.Missier
7
Same example — PROV-O notation
:draftComments a prov:Entity ;
:distr "internal"^^xsd:string ;
prov:wasGeneratedBy :commenting .
:commenting a prov:Activity ;
prov:used :draftV1 .
:draftV1 a prov:Entity ;
:distr "internal"^^xsd:string ;
:status "draft"^^xsd:string ;
:version "0.1"^^xsd:string ;
prov:wasGeneratedBy :drafting .
:drafting a prov:Activity ;
prov:used :paper1,
:paper2 .
:paper1 a prov:Entity,
"reference"^^xsd:string .
:paper2 a prov:Entity,
"reference"^^xsd:string .
(RDF / Turtle notation)
DataScienceWorkshop
Islamabad,April2017
P.Missier
8
Association, Attribution, Delegation: who did what?
An activity association is an assignment of responsibility to an agent for an activity,
indicating that the agent had a role in the activity.
Attribution is the ascribing of an entity to an agent.
entity(ex:draftComments, [ ex:distr='internal' ])
activity(ex:commenting)
agent(ex:Bob, [prov:type = "mainEditor"] )
agent(ex:Alice, [prov:type = "srEditor"])
wasAssociatedWith(ex:commenting, Bob, -, [prov:role = "editor"])
actedOnBehalfOf(Bob, Alice)
wasAttributedTo(ex:draftComments, ex:Bob)
DataScienceWorkshop
Islamabad,April2017
P.Missier
9
Same example — PROV-O notation (RDF/N3)
:Alice a prov:Agent,
"ex:chiefEditor";
:firstName "Alice";
:lastName "Cooper".
:Bob a prov:Agent,
"ex:seniorEditor";
:firstName "Robert";
:lastName "Thompson"^;
prov:actedOnBehalfOf :Alice .
:draftComments prov:wasAttributedTo :Bob .
:drafting a prov:Activity ;
prov:wasAssociatedWith :Bob .
DataScienceWorkshop
Islamabad,April2017
P.Missier
10
Association and Attribution
Q.: what is the relationship between attribution and association?
This is defined as an inference rule in the PROV-CONSTR document
entity(e)
agent(Ag)
activity(a)
wasAttributedTo(e, Ag)
wasGeneratedBy(e, a,-)
wasAssociatedWith(a, Ag,-)
DataScienceWorkshop
Islamabad,April2017
P.Missier
11
Three Views of Provenance
DataScienceWorkshop
Islamabad,April2017
P.Missier
12
Derivation amongst entities
A derivation is a transformation of an entity into another, an update of an entity
resulting in a new one, or the construction of a new entity based on a pre-existing
entity.
entity(ex:draftV1)
entity(ex:draftComments)
wasDerivedFrom(ex:draftComments, ex:draftV1)
Q.: what is the relationship between derivation, generation, and usage?
:draftComments a prov:Entity ;
prov:wasDerivedFrom :draftV1 .
:draftV1 a prov:Entity .
DataScienceWorkshop
Islamabad,April2017
P.Missier
13
From “scruffy” provenance to “valid” provenance
- Are all possible temporal partial ordering of events equally acceptable?
- How can we specify the set of all valid orderings?
- how do we formally define what it means for a set of provenance
statements to be valid?
PROV defines a set of temporal constraints that ensure consistency
of a provenance graph
DataScienceWorkshop
Islamabad,April2017
P.Missier
14
Talk Outline
• Provenance, why? (in science)
• Provenance of Scientific Data
• The DataONE Federation of Data Repositories (dataone.org)
• Provenance for Data Science
• Provenance-enabled data analytics frameworks
• Provenance in the ReComp project
• (Provenance for streaming data analytics)
DataScienceWorkshop
Islamabad,April2017
P.Missier
15
Talk Outline
• Provenance, why? (in science)
• Provenance of Scientific Data
• The DataONE Federation of Data Repositories (dataone.org)
• Provenance for Data Science
• Provenance-enabled data analytics frameworks
• Provenance in the ReComp project
• Provenance for streaming data analytics
DataScienceWorkshop
Islamabad,April2017
P.Missier
16
Why provenance?
Provenance in machine learning:
• Why is my predictive algorithm recommending these new friends to me?
• How can I trust my classifier’s predictions?
[1] Ceolin, D., Groth, P., Maccatrozzo, V., Fokkink, W., Hage, W. R. Van, & Nottamkandath, A. (2016).
Combining User Reputation and Provenance Analysis for Trust Assessment. J. Data and
Information Quality, 7(1–2), 6:1--6:28. http://doi.org/10.1145/2818382
• Reproducibility of your own and your peers’ work
• i.e. in experimental Science
Example: assessing trust in Web artifacts and crowdsourced annotations [1]
• Communication:
To engender trust in the data and amongst the people and systems that are
responsible for it
• Understandability:
• to explain the outcome of a complex decision process
DataScienceWorkshop
Islamabad,April2017
P.Missier
17
Trusted Web data: Provenance on the Web
Tim Berners-Lee’s “Oh Yeah” button:
http://users.ugent.be/~tdenies/OhYeah/
Easy Access to Provenance: an Essential Step Towards Trust on the Web, Procs METHOD 2013: The 2nd IEEE
International Workshop on Methods for Establishing Trust with Open Data Held in conjunction with COMPSAC,
the IEEE Signature Conference on Computers, Software & Applications - July 22-26, 2013 - Kyoto, Japan
http://dx.doi.org/10.1109/COMPSACW.2013.29
DataScienceWorkshop
Islamabad,April2017
P.Missier
18
Understandability: explaining process outcomes
• Which process was used to derive a
diagnosis?
• How did the process use the input
data?
• How were the steps configured?
• Which decisions were made by
human experts (clinicians)?
MAF threshold
- Non-synonymous
- stop/gain
- frameshift
known polymorphisms
Homo / Heterozygous
Pathogenicity
predictors
Variant filtering
HPO match
HPO to OMIM
OMIM match
OMIM to Gene
Gene
Union
Gene
Intersect
Genes in scope
User-supplied
genes list
User-supplied
disease keywords
User-defined
preferred genes
Variant Scoping
Candidate
variants
Select
variants
in scope
variants
in scope
ClinVar
lookupClinVar
Annotated
patient
variants
Variant Classification
RED:
found,
pathogenic
AMBER:
not found
GREEN:
found,
benign
OMIM
AMBER/
not found
AMBER/
uncertain
NGS
pipeline
Clinical diagnosis of genetic diseases
Example provenance query:
“Find all invocations that used a
specific version of ClinVar and OMIM,
and group them by phenotype”
DataScienceWorkshop
Islamabad,April2017
P.Missier
19
Reproducibility and dissemination in Science
Experimental science is data-intensive
Independent validation of results claims is a
cornerstone of scientific discourse
Provenance is the equivalent of a formal logbook
• Capture all steps involved in the derivation of a
result
• How much detail?
• Replay, validate, compare
DataScienceWorkshop
Islamabad,April2017
P.Missier
20
Lifecycle of experimental datasets
Search
discover
packagepublish
spec(P’)
Deploy
P’ 
Env(dep’)
prov(D’)
Compare
(P,P’,D,D’)
spec(P)
prov(D)
D  D1
P  P’
dep  dep’
<D,P,dep,spec(P), prov(D)>
compute
Env
D’
D1
DataScienceWorkshop
Islamabad,April2017
P.Missier
21
Reproducibility: working. reporting
submit article
and move on…
publish article
Research
Environment
Publication
Environment
Peer
Review
DataScienceWorkshop
Islamabad,April2017
P.Missier
22
Re-what?
Re-*
ReRun:
vary experiment and setup, same lab
P P’
DD’
depdep’
Repeat:
Same experiment, setup, lab
P, D, dep, env(dep)
Replicate:
Same experiment, setup, different lab
P, D, dep, env’(dep)
Reproduce:
vary experiment and setup, different lab
P P’
DD’
depdep’
env(dep) env’(dep’)
Reuse:
Different experiment
D, P  Q
DataScienceWorkshop
Islamabad,April2017
P.Missier
23
Lifecycle with tools annotations
Search
discover
packagepublish
D  D1
P  P’
dep  dep’
compute
Env
D’
prov(D’)
Compare
(P,P’,D,D’)
spec(P)
prov(D)
Research
Objects
DataONE
Federated
Research Data
Repositories
TOSCA-based
virtualisation
Pdiff: differencing
provenance
YesWorkflow
- Workflow
Provenance
- NoWorkflow
Matlab
provenance
recorder
(DataONE)
ReproZip
spec(P’)
Deploy
P’ 
Env(dep’)
DataScienceWorkshop
Islamabad,April2017
P.Missier
24
References
Research Objects: www.researchobject.org
Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et
al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011).
doi:doi:10.1016/j.future.2011.08.004.
DataONE: dataone.org
Cuevas-Vicenttín, Víctor, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati,
Yaxing Wei, David Koop, and Saumen Dey. “The PBase Scientific Workflow Provenance Repository.”
In Procs. 9th International Digital Curation Conference, 9:28–38. San Francisco, CA, USA, 2014.
doi:10.2218/ijdc.v9i2.332.
Process Virtualisation using TOSCA
Qasha, Rawaa, Jacek Cala, and Paul Watson. “Towards Automated Workflow Deployment in the Cloud Using
TOSCA.” In 2015 IEEE 8th International Conference on Cloud Computing, 1037–1040. New York, 2015.
doi:10.1109/CLOUD.2015.146.
NoWorkflow: provenance recording for Python
Murta, Leonardo, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire.
“noWorkflow: Capturing and Analyzing Provenance of Scripts⋆.” In Procs. IPAW’14. Cologne,
Germany: Springer, 2014.
YesWorkflow: Qian Zhang, Yang Cao, Qiwen Wang, Duc Vu, Priyaa Thavasimani, Timothy
McPhillips, Paolo Missier, Bertram Ludäscher, Revealing the Detailed Lineage of Script Outputs using
Hybrid Provenance, Procs. International Data Curation Conference, DCC, Edinburgh, 2017.
DataScienceWorkshop
Islamabad,April2017
P.Missier
25
PDIFF Comparing provenance traces
15
A graph obtained as a result of traces “diff”
which can be used to explain observed differences in workflow outputs, in
terms of differences throughout the two executions.
This is the simplest
possible delta “graph”!d1
S0
d2
S1
z w
S2
d3
yx
S3
S4
df
d1'
S0
d2'
S1
z w'
S2
d3
y'x
S3
S4
df'
(i) Trace A (ii) Trace B
P0 P1
P0 P1
P0 P1
P0 P1
dF , dF
y, y
w, w
d2, d2
(iii) Delta tree
Two executions of the same workflow, with slight differences:
- Unintentional changes: eg incorrect porting/re-deployment  cause analysis
- Intentional changes: eg different parameter settings  impact analysis
Missier, P., Woodman, S., Hiden, H., & Watson, P. (2013). Provenance and data differencing for
workflow reproducibility analysis. Concurrency and Computation: Practice and Experience, 28(4),
995–1015. http://doi.org/10.1002/cpe.3035
DataScienceWorkshop
Islamabad,April2017
P.Missier
26
Components for a flexible, scalable,
sustainable network
DataONE: Cyberinfrastructure
www.dataone.org/member-nodes
Coordinating Nodes
• retain complete
metadata catalog
• indexing for search
• network-wide services
• ensure content
availability
(preservation)
• replication services
Member Nodes
• diverse institutions
• serve local community
• provide resources for
managing their data
• retain copies of data
DataScienceWorkshop
Islamabad,April2017
P.Missier
27
Cyberinfrastructure
Data Services: Extraction,
sub-setting etc
ontolog
y
annotation
System
Metadata
Science
Data
Search
API
Science
Metadata
Provenance
Replicate
Metadata
Index
DataScienceWorkshop
Islamabad,April2017
P.Missier
28
Data Holdings
DataScienceWorkshop
Islamabad,April2017
P.Missier
29
ProvONE: extending PROV with process structure
https://purl.dataone.org/provone-v1-dev
Yang Cao, Christopher Jones, Vıctor Cuevas-Vicenttın, Matthew B. Jones, Bertram Ludascher, Timothy
McPhillips, Paolo Missier, Christopher Schwalm, Peter Slaughter, Dave Vieglais, Lauren Walker, Yaxing Wei,
ProvONE: extending PROV to support the DataONE scientific community, TAPP workshop on Theory and
Practice of Provenance, 2016.
Workflow structure
Retrospective provenance
DataScienceWorkshop
Islamabad,April2017
P.Missier
30
Simple user-level provenance visualisation
DataScienceWorkshop
Islamabad,April2017
P.Missier
31
Database provenance
• Why is record R included in the result of query? [why-provenance] (*)
(*) Cheney, J., Chiticariu, L., & Tan, W.-C. (2009). Provenance in Databases: Why, How, and Where.
Foundations and Trends in Databases, 1, 379–474.
(+) Herschel, M., & Hernández, M. A. (2010). Explaining missing answers to SPJUA queries.
Proceedings of the VLDB Endowment, 3(1–2), 185–196. http://doi.org/10.14778/1920841.1920869
SIGMOD Tutorial 2007 Provenance in Databases 7
Example of Data Provenance
n A typical question:
n For a given database query Q, a database D, and a tuple t in the output
of Q(D), which parts of D “contribute” to t?
n The provenance of tuple (John, D01, Mary) in the output consists of
the source facts R(John, D01) and S(D01, Mary) according to the
query Q.
n The question could also be applied to an attribute value, a table, or
any subtree in hierarchical/tree-like data.
R
Emp Dept
John D01
Susan D02
Anna D04
S
Did Mgr
D01 Mary
D02 Ken
D03 Ed Q = select r.A, r.B, s.C
from R r, S s
where r.B = s.B
Q
Emp Dept Mgr
John D01 Mary
Susan D02 Ken
• Why is record R not in the result? [why-not provenance] (+)
Source: Provenance in Databases: Past, Current, Future. Peter Buneman University of Edinburgh. Wang-Chiew
Tan UC Santa Cruz. SIGMOD Tutorial 2007
DataScienceWorkshop
Islamabad,April2017
P.Missier
32
Talk Outline
• Provenance, why? (in science)
• Provenance, of Scientific Data: The DataONE Federation of Data
Repositories (dataone.org)
• Provenance for Data Science
• Provenance-enabled data analytics frameworks
• Provenance in the ReComp project
• (Provenance for streaming data analytics)
DataScienceWorkshop
Islamabad,April2017
P.Missier
33
Provenance from analytics
Analytics data processing generates potentially valuable knowledge
But:
Credibility of the outcomes requires
- Understandability of data processing
human-oriented: "Prospective provenance"
- Provenance recording and query
machine-oriented: PROV  XML, RDF, Neo4J data + query models
Problem:
- Process complexity, lack of transparency 
- Provenance is coarse-grained or complex to understand
DataScienceWorkshop
Islamabad,April2017
P.Missier
34
Some research prototypes
Titian (*)
Apache Spark maintains the program
transformation lineage to recover from
failures
• Titian enhances the Spark RDD
programming model
• data provenance capture
• interactive query support
(*) Interlandi, M., Shah, K., Tetali, S. D., Gulzar, M. A., Yoo, S., Kim, M., Condie, T. (2015). Titian: Data Provenance
Support in Spark. Proc. VLDB Endow., 9(3), 216–227. http://doi.org/10.14778/2850583.2850595
Lipstick on a pig (+)(++)
A framework that marries database-
style and workflow provenance models
The catch… all modules must be
implemented in Pig Latin
(+) Amsterdamer, Y., Davidson, S. B., Deutch, D., Milo, T., Stoyanovich, J., & Tannen, V. (2011). Putting lipstick on
pig: enabling database-style workflow provenance. Proc. VLDB Endow., 5(4), 346–357.
http://dl.acm.org/citation.cfm?id=2095686.2095693
(++) Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (2008). Pig latin: a not-so-foreign language for
data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data
(pp. 1099–1110). New York, NY, USA: ACM. http://doi.org/http://doi.acm.org/10.1145/1376616.1376726
DataScienceWorkshop
Islamabad,April2017
P.Missier
35
Some research prototypes
RAMP (**) (Reduce And Map Provenance)
An extension to Hadoop that transparently wraps each function with
provenance capture
(**) Ikeda, R., Park, H., & Widom, J. Provenance for generalized map and reduce workflows. In: CIDR 2011.
DataScienceWorkshop
Islamabad,April2017
P.Missier
36
Provenance for analytics: Titian
Apache Spark natively maintains the program transformation lineage so that it can
reconstruct lost RDD partitions in the case of a failure
• Titian enhances it with data provenance capture and interactive query support
that extends the Spark RDD programming model
• With limited overhead of less than 30%
[1] Interlandi, M., Shah, K., Tetali, S. D., Gulzar, M. A., Yoo, S., Kim, M., Condie, T. (2015). Titian: Data Provenance
Support in Spark. Proc. VLDB Endow., 9(3), 216–227. http://doi.org/10.14778/2850583.2850595
LineageRDD methods for traversing through the data
lineage in both backward and forward directions [1] Job workflow after adding the lineage capture points
DataScienceWorkshop
Islamabad,April2017
P.Missier
37
Provenance for analytics: “Lipstick on a pig”
[2] Amsterdamer, Y., Davidson, S. B., Deutch, D., Milo, T., Stoyanovich, J., & Tannen, V. (2011). Putting lipstick on
pig: enabling database-style workflow provenance. Proc. VLDB Endow., 5(4), 346–357.
http://dl.acm.org/citation.cfm?id=2095686.2095693
[3] Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (2008). Pig latin: a not-so-foreign language for
data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data
(pp. 1099–1110). New York, NY, USA: ACM. http://doi.org/http://doi.acm.org/10.1145/1376616.1376726
“A framework that marries database-style and workflow provenance models
capturing internal state as well as fine-grained dependencies in workflow provenance”
The catch… all modules must be
implemented in Pig Latin:
“an emerging language that
combines high-level declarative
querying with low-level procedural
programming and parallelization in
the style of map-reduce” [3]
DataScienceWorkshop
Islamabad,April2017
P.Missier
38
Provenance for analytics: Map-Reduce
[4] Ikeda, R., Park, H., & Widom, J. (2011). Provenance for generalized map and reduce workflows. In: CIDR 2011.
Scope: generalized map and reduce workflows (GMRWs)
• input data sets are processed by an acyclic graph of map and reduce functions
Transformations: Map, Reduce, Union, Split
• Each transformation has an associated provenance operator
RAMP (Reduce And Map Provenance), an extension to Hadoop that transparently
wraps each function with provenance capture:
DataScienceWorkshop
Islamabad,April2017
P.Missier
39
Provenance for analytics: Map-Reduce
[5] Crawl, D., Wang, J., & Altintas, I. (2011). Provenance for MapReduce-based data-intensive workflows. In
Proceedings of the 6th workshop on Workflows in support of large-scale science (pp. 21–30).
Kepler+Hadoop framework
Works for Kepler workflows that invoke Hadoop jobs
Capture provenance inside the MapReduce job as
DataScienceWorkshop
Islamabad,April2017
P.Missier
40
Provenance for analytics: Map-Reduce
[6] Murray, D. G., McSherry, F., Isard, M., Isaacs, R., Barham, P., & Abadi, M. (2016). Incremental, iterative data
processing with timely dataflow. Communications of the ACM, 59(10), 75–83. http://doi.org/10.1145/2983551
Extend the MapReduce framework with change propagation
• The framework keeps track of the
dependencies between subsets of
each MapReduce computation
• When a subset of the input
changes, rebuilds only the parts of
the computation and the output
affected by the changes
DataScienceWorkshop
Islamabad,April2017
P.Missier
41
Challenges
• Too little, too much provenance
• Not at the right level of abstraction
Complex, black box analytics  coarse-grained provenance
White-box analytics frameworks  fine-grained, too detailed
• need abstraction / view mechanisms!
Ad hoc analytics (eg Python, R) ??
Need flexible, user-level provenance capture from complex analytics
DataScienceWorkshop
Islamabad,April2017
P.Missier
42
Additional recent research on Provenance and Big Data
Chen, Peng; Plale, Beth A., "Big Data Provenance Analysis and Visualization," Cluster, Cloud and Grid
Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.797,800, 4-7
May 2015 doi: 10.1109/CCGrid.2015.85
Chen, Peng; Plale, Beth A., "ProvErr: System Level Statistical Fault Diagnosis Using Dependency
Model," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium
on , vol., no., pp.525,534, 4-7 May 2015
doi: 10.1109/CCGrid.2015.86
Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs
Peter Macko and Margo Seltzer, Harvard University, Procs. TAPP’11, 2011, Crete, Greece
Provenance from Log Files: a BigData Problem, Devarshi Ghoshal and
Beth Plale, Procs. BigProv workshop, EDBT, Genova, Italy, 2013
Adam Bates, Kevin Butler and Thomas Moyer. Take Only What You Need: Leveraging Mandatory
Access Control Policy to Reduce Provenance Storage Costs. In Procs. TAPP’15 workshop,
Edinburgh, 2015
http://workshops.inf.ed.ac.uk/tapp2015/TAPP15_II_3.pdf
DataScienceWorkshop
Islamabad,April2017
P.Missier
43
Talk Outline
• Provenance, why? (in science)
• Provenance of Scientific Data
• The DataONE Federation of Data Repositories (dataone.org)
• Provenance for Data Science
• Provenance-enabled data analytics frameworks
• Provenance in the ReComp project
• (Provenance for streaming data analytics)
DataScienceWorkshop
Islamabad,April2017
P.Missier
44
ReComp
Metadata analytics:
Provenance for selective re-computation of big data analytics
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
Funded by the EPSRC on the Making Sense from Data call (2016 – 2019)
http://recomp.org.uk/
DataScienceWorkshop
Islamabad,April2017
P.Missier
45
Example: NGS variant calling and clinical interpretation
Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis
- Eg 100K Genome Project, Genomics England, GeCIP
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Also: Metagenomics: Species identification. Eg The EBI metagenomics portal
Workflow helps confirm/reject hypotheses of patient’s phenotype
Classifies variants into three categories: RED, GREEN, AMBER
(pathogenic, benign and unknown/uncertain)
DataScienceWorkshop
Islamabad,April2017
P.Missier
46
ReComp
Record execution
history
• transparency
Detect and
measure
changes
Estimate impact of
changes
• Scoping
• Prioritisation
Enact on demand
• Partial re-run
• Differential execution
History
DB
Data
diff(.,.)
functions
Changes:
• Algorithms and tools
• Accuracy of input sequences
• Reference databases
Change
Events
DataScienceWorkshop
Islamabad,April2017
P.Missier
47
Goal: Minimise re-comp effort across all prior outcomes
Objective:
• Reduce the amount of computation performed in reaction to changes
Constraint: selective re-computation should be lossless
• All instances that may be subject to impact must be considered
• P1: Partial re-execution
• Reduce re-computation to only those parts of a process that are actually
involved in the processing of the changed data
• P2: Differential execution
• Insight: If an instance I of process P is executed using the delta between two
versions of the inputs and it produces empty result, then I is not affected by the
version change
• Only feasible if some algebraic properties of the process hold
• P3: Identifying the scope of change
• Determine which instances out of a population of outcomes are going to be
affected by the change
DataScienceWorkshop
Islamabad,April2017
P.Missier
48
SVI process: implementation using workflow
Phenotype to genes
Variant selection
Variant classification
Patient
variants
GeneMap
ClinVar
Classified variants
Phenotype
hypothesis
e-Science Central WFMS
DataScienceWorkshop
Islamabad,April2017
P.Missier
49
The ProvONE provenance data model
Workflow structure
Retrospective provenance
DataScienceWorkshop
Islamabad,April2017
P.Missier
50
P1: Partial re-execution
Objective: find the minimal sub-graph of a workflow that is affected by a change
Approach:
• e-SC generates one ProvONE provenance trace for each workflow run
• Use traces to identify the minimal sub-workflow that is affected by the change
:- invocation(I), wasPartOf(A,I),
wasDerivedFrom(d’,Dep),used(A,Dep)
Initial query: given a new version d’ of a reference database d:
Variable I denotes one execution instance (workflow invocation)
Traversal queries:
:- execution(A1), execution(A2),
wasPartOf(A,I), wasInformedBy(A2,A1)
collect all other activities
connected to A (implicit
data transfer)
:- execution(A1), execution(A2),
wasGeneratedBy(Data,A1), used(A2,Data)
collect all other activities
connected to A (explicit
data transfer D from A1
to A2)
Find all activities A within
invocation I, that used a
prior version Dep of d’
DataScienceWorkshop
Islamabad,April2017
P.Missier
51
Minimal sub-graphs in SVI
Change in
ClinVar
Change in
GeneMap
Partial execution following a change in only one of the databases requires caching the
intermediate data at the boundary of the blue and read areas
DataScienceWorkshop
Islamabad,April2017
P.Missier
52
Results
• How much can we save?
• Process structure
• First usage of reference data
• Overhead: storing interim data required in partial re-execution
• 156 MB for GeneMap changes and 37 kB for ClinVar changes
Time savings Partial
re-execution
(sec)
Complete
re-execution
(sec)
Time saving (%)
GeneMap 325 455 28.5
ClinVar 287 455 37
DataScienceWorkshop
Islamabad,April2017
P.Missier
53
P2: Differential execution
Suppose D is a relation (a table). diffD() can be expressed as:
Where:
The idea is to compute:
as the combination of:
This is effective if:
If the operators that make up P satisfy certain algebraic properties(*), then this
can be achieved as follows:
(*) Associative, distributive over set union and difference
DataScienceWorkshop
Islamabad,April2017
P.Missier
54
P2: Partial re-computation using input difference
Insight: run SVI but replace ClinVar query with a query on ClinVar version diff:
Q(CV)  Q(diff(CV1, CV2))
Works for SVI, but hard to generalise: depends on the type of process
Gain: diff(CV1, CV2) much smaller than CV2
GeneMap versions
from –> to
ToVersion rec.
count
Difference
rec. count Reduction
16-03-08 –> 16-06-07 15910 1458 91%
16-03-08 –> 16-04-28 15871 1386 91%
16-04-28 –> 16-06-01 15897 78 99.5%
16-06-01 –> 16-06-02 15897 2 99.99%
16-06-02 –> 16-06-07 15910 33 99.8%
ClinVar versions
from –> to
ToVersion rec.
count
Difference
rec. count Reduction
15-02 –> 16-05 290815 38216 87%
15-02 –> 16-02 285042 35550 88%
16-02 –> 16-05 290815 3322 98.9%
DataScienceWorkshop
Islamabad,April2017
P.Missier
55
P3: Identifying the scope of change: a game of battleship
Patient / change impact matrix
Challenge:
precisely identify the scope of a change
Blind reaction to change: recompute the entire matrix
Can we do better?
- Hit the high impact cases (the X) without re-
computing the entire matrix
DataScienceWorkshop
Islamabad,April2017
P.Missier
56
A scoping algorithm
Coarse-grained provenance indicates whether or not a dependency on D existed
… but not which specific data from version Dt of D was used
Candidate invocation:
Any invocation I of P whose provenance contains statements of the form:
used(A,Dt), wasPartOf(A,I), wasAssociatedWith(I,_,P)
used(I,Dt),wasAssociatedWith(I,_,P)
- For each candidate invocation I of P:
- partially re-execute using the difference sets as inputs
- compute the minimal subgraph P’ of P that needs re-computation
- repeat:
execute P’ one step at-a-time
until <empty output> or <P’ completed>
- If <P’ completed> then
- Execute P’ on the full inputs
Sketch of the algorithm (simplified):
DataScienceWorkshop
Islamabad,April2017
P.Missier
57
Scoping: precision
• The approach avoids the majority of re-computations given a ClinVar change
• Reduction in number of complete re-executions from 495 down to 71
DataScienceWorkshop
Islamabad,April2017
P.Missier
58
Scoping: efficiency
Total re-computation time for the whole patient cohort
0
1
2
3
4
5
Executiontime[hours]
ClinVar update date [mm/yy]
CV blind CV selective process
CV selective scope, δ-gen CV selective scope, δ-SVI
We expect to pay a penalty for running the algorithm when the difference sets are
large compared to actual new data
More accurate diff functions result in higher runtime savings
• Smaller difference sets
• More precise scoping
DataScienceWorkshop
Islamabad,April2017
P.Missier
59
Provenance in ReComp: Summary and Challenges
Objective:
• Reduce the amount of computation performed in reaction to changes
1. Partial re-execution of previously computed workflows
2. (Differential execution)
3. Identifying the scope of change
• Makes use of (2) to determine which instances of a population of outcomes are
going to be affected by a change
Challenges / work in progress:
• Validate and extend the approach to other case studies
• Design estimators to predict the impact of change
• Design and implement a generic ReComp meta-process
• Observe P in execution
• Detect changes
• Selectively react to changes
DataScienceWorkshop
Islamabad,April2017
P.Missier
60
Talk Outline
• Provenance, why? (in science)
• Provenance, of what?
• Of Scientific Data: The DataONE Federation of Data Repositories (dataone.org)
• Of database data (very briefly)
• Of Web data  the W3C PROV data model for provenance
• Provenance for Data Science
• Provenance-enabled data analytics frameworks
• Provenance in the ReComp project
• Provenance for streaming data analytics
DataScienceWorkshop
Islamabad,April2017
P.Missier
61
Process-specific provenance using templates
Process
definition:
P: X --> Y
Provenance
template T
Sidecar binding
process definition:
PB: <xi,yi,P> --> B
Execution:
xi --> Pi --> yi
<xi,yi,P> PB B
Apply(B,T)
PROV
document
P
Aim: enable provenance generation from black box analytics
Approach:
1) human-oriented: "Prospective provenance" --> YesWorkflow
2) machine-oriented: PROV --> XML, RDF, Neo4J data + query models
The resulting provenance is
• Application-level
• User-specified
• Coarse-grained
DataScienceWorkshop
Islamabad,April2017
P.Missier
62
Illustration: the case of map()
[y1 … yn] = map(λ x.f(x), [x1 … xn])
Process definition:
map: <X, lambda x: f(x)> --> Y
Provenance
template T
:x :y
f
:a
Execution:
[y1 … yn] = map(lambda x: f(x), [x1 … xn])
:y. :a. f :x
Binding:
B = { <:x ← x1, :y ← y1, :a ← map1>,
…
<:x ← xn, :y ← yn, :a ← mapn> }
:x1 :map1 :y1
Type:plan
Type:plan
f
:xn :yn:map1
Apply(B,T)
Sidecar process BP
used
used
used
gen
gen
gen
assoc
assoc
DataScienceWorkshop
Islamabad,April2017
P.Missier
63
PROV-N rendering of the map() case
[y1 … yn] = map(λ x.f(x), [x1 … xn])
Provenance template T
entity(f, [prov:type = ‘prov:plan'])
entity(:x)
entity(:y)
activity(:a), wasAssociatedWith(:a,_,f)
used(:a,:x), wasGeneratedBy(:y,:a)
wasDerivedFrom(:y,:x)
B = {<:a ← gen(a, i), :x ← xi, :y ← yi > |i : 1 . . . n}
entity(f, [prov:type = ‘prov:plan'])
activity(a_1),
wasAssociatedWith(a_1,_,f)
…
activity(a_n),
wasAssociatedWith(a_n,_,f)
entity(x_1), ..., entity(x_n)
entity(y_1), ... ,entity(y_n)
used(a_1, x_1), wasGeneratedBy(y_1, a_1)
…
used(a_n, x_n), wasGeneratedBy(y_n, a_n)
wasDerivedFrom(y_1,x_1)
…
wasDerivedFrom(y_n,x_n)
Prov = apply(B,T)
DataScienceWorkshop
Islamabad,April2017
P.Missier
64
Application of template approach to streaming analytics
Data in movement is a prime source for value-added analytics applications
- Eg data streams from Internet of Things devices
The provenance of an output data stream is a stream of provenance statements
The template / sidecar process / binding framework applies without changes:
Following the Spark streaming model:
A stream is discretised into a sequence of micro-batch intervals
Window W = [ x1 … xk]: user-defined sequence of intervals xi
Process P: Wi  Y operates on a window at a time
(Y may be a sequence or other multivariate data structure)
DataScienceWorkshop
Islamabad,April2017
P.Missier
65
Provenance streams
:w :y:a
used gen
A simple template: ”each :y is generated using the content of one window”
B1
P , B2
P , … invoked alongside each invocation of P1, P2 ,…
Each Bi
P produces a binding Bi
Each Bi is applied to the template, producing a PROV document Provi
… this results in a stream of provenance alongside the stream of outputs yi
Execution:
wi = [xi1 … xin]
y1 = P1 (w1)  B1 = B1
P (w1, P1, y1)
y2 = P2 (w2)  B2 = B2
P (w2, P2, y2)
…
:y :a :w
Prov1 = apply(B1,T)
Prov2 = apply(B2,T)
…
DataScienceWorkshop
Islamabad,April2017
P.Missier
66
Extensions
The framework applies to a stateful process P
- P’s outcome depends on an internal state S
- P’s execution may modify S
Hint: S is itself an entity with provenance, defined by its own template and bindings
Add flexibility by allowing multiple templates and multiple sidecar processes for the
same process execution
The actual provenance output at the end of the process:
Prov = apply(B,T)
is an extensional representation.
< T, B> is an intensional representation
More space-efficient (T is only stored once, B is a set of variable-value pairs)
Only serialised when needed
DataScienceWorkshop
Islamabad,April2017
P.Missier
67
Summary
• Provenance, why? (in science)
• Provenance, of Scientific Data: The DataONE Federation of Data
Repositories (dataone.org)
• Provenance for Data Science
• Provenance-enabled data analytics frameworks
• Provenance in the ReComp project
• (Provenance for streaming data analytics)
DataScienceWorkshop
Islamabad,April2017
P.Missier
68
Questions?
http://recomp.org.uk/
DataScienceWorkshop
Islamabad,April2017
P.Missier
69
Selected bibliography
Moreau, Luc, Paolo Missier, Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell,
et al. PROV-DM: The PROV Data Model. Edited by Luc Moreau and Paolo Missier, 2012.
http://www.w3.org/TR/prov-dm/
Cheney, James, Paolo Missier, and Luc Moreau. Constraints of the Provenance Data Model, 2012.
http://www.w3.org/TR/prov-constraints/
Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. “The Rationale of PROV.” Web
Semantics: Science, Services and Agents on the World Wide Web (April 2015).
doi:10.1016/j.websem.2015.04.001.
http://www.sciencedirect.com/science/article/pii/S1570826815000177
Marinho, Anderson, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio Manuel Serra da Cruz,
Eduardo Ogasawara, and Marta Mattoso. “ProvManager: a Provenance Management System for Scientific
Workflows.” Concurrency and Computation: Practice and Experience 24, no. 13 (2012): 1513–1530.
http://dx.doi.org/10.1002/cpe.1870.
ProvGen: generating synthetic PROV graphs with predictable structure.
Firth, H.; and Missier, P. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer
http://arxiv.org/pdf/1406.2495
ProvAbs: model, policy, and tooling for abstracting PROV graphs.
Missier, P.; Bryans, J.; Gamble, C.; Curcin, V.; and Danger, R. In Procs. IPAW 2014 (Provenance and
Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.1998
De Oliveira, Daniel, Vítor Silva, and Marta Mattoso. “How Much Domain Data Should Be in Provenance
Databases?” In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15). Edinburgh,
Scotland: USENIX Association, 2015. https://www.usenix.org/conference/tapp15/workshop-
program/presentation/de-oliveira.

More Related Content

What's hot

How much does $1.7 billion buy?
How much does $1.7 billion buy?How much does $1.7 billion buy?
How much does $1.7 billion buy?Martin Klein
 
Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...
Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...
Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...Artificial Intelligence Institute at UofSC
 
10 things statistics taught us about big data
10 things statistics taught us about big data10 things statistics taught us about big data
10 things statistics taught us about big datajtleek
 
Contextualized Knowledge Graph from two perspectives: Semantic Web and Graph...
Contextualized Knowledge Graphfrom two perspectives: Semantic Web and Graph...Contextualized Knowledge Graphfrom two perspectives: Semantic Web and Graph...
Contextualized Knowledge Graph from two perspectives: Semantic Web and Graph...Vinh Nguyen
 
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeVisualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeJonathan Yu
 
Facilitating Human Intervention in Coreference Resolution with Comparative En...
Facilitating Human Intervention in Coreference Resolution with Comparative En...Facilitating Human Intervention in Coreference Resolution with Comparative En...
Facilitating Human Intervention in Coreference Resolution with Comparative En...Gong Cheng
 
Data-mining the Semantic Web @TCD
Data-mining the Semantic Web @TCDData-mining the Semantic Web @TCD
Data-mining the Semantic Web @TCDFrank Lynam
 
Don’t like RDF Reification? Making Statements about Statements Using Singleto...
Don’t like RDF Reification? Making Statements about Statements Using Singleto...Don’t like RDF Reification? Making Statements about Statements Using Singleto...
Don’t like RDF Reification? Making Statements about Statements Using Singleto...Vinh Nguyen
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web PagesMichael Nelson
 
Dagstuhl FOAF history talk
Dagstuhl FOAF history talkDagstuhl FOAF history talk
Dagstuhl FOAF history talkDan Brickley
 
Forging New Links: Libraries in the Semantic Web
Forging New Links: Libraries in the Semantic WebForging New Links: Libraries in the Semantic Web
Forging New Links: Libraries in the Semantic WebGillian Byrne
 

What's hot (13)

How much does $1.7 billion buy?
How much does $1.7 billion buy?How much does $1.7 billion buy?
How much does $1.7 billion buy?
 
Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...
Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...
Semantic Web Foundations for Representing, Reasoning, and Traversing Contextu...
 
10 things statistics taught us about big data
10 things statistics taught us about big data10 things statistics taught us about big data
10 things statistics taught us about big data
 
Signposting Overview
Signposting OverviewSignposting Overview
Signposting Overview
 
Contextualized Knowledge Graph from two perspectives: Semantic Web and Graph...
Contextualized Knowledge Graphfrom two perspectives: Semantic Web and Graph...Contextualized Knowledge Graphfrom two perspectives: Semantic Web and Graph...
Contextualized Knowledge Graph from two perspectives: Semantic Web and Graph...
 
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeVisualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscape
 
Facilitating Human Intervention in Coreference Resolution with Comparative En...
Facilitating Human Intervention in Coreference Resolution with Comparative En...Facilitating Human Intervention in Coreference Resolution with Comparative En...
Facilitating Human Intervention in Coreference Resolution with Comparative En...
 
Data-mining the Semantic Web @TCD
Data-mining the Semantic Web @TCDData-mining the Semantic Web @TCD
Data-mining the Semantic Web @TCD
 
Don’t like RDF Reification? Making Statements about Statements Using Singleto...
Don’t like RDF Reification? Making Statements about Statements Using Singleto...Don’t like RDF Reification? Making Statements about Statements Using Singleto...
Don’t like RDF Reification? Making Statements about Statements Using Singleto...
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
 
Dagstuhl FOAF history talk
Dagstuhl FOAF history talkDagstuhl FOAF history talk
Dagstuhl FOAF history talk
 
Forging New Links: Libraries in the Semantic Web
Forging New Links: Libraries in the Semantic WebForging New Links: Libraries in the Semantic Web
Forging New Links: Libraries in the Semantic Web
 
Library Linked Data
Library Linked DataLibrary Linked Data
Library Linked Data
 

Similar to Data Provenance and its role in Data Science

The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...Paolo Missier
 
Linked Data and Time – Modeling Researcher Life Lines by Events
Linked Data and Time – Modeling Researcher Life Lines by EventsLinked Data and Time – Modeling Researcher Life Lines by Events
Linked Data and Time – Modeling Researcher Life Lines by EventsCarsten Keßler
 
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...DeVonne Parks, CEM
 
Watson at RPI - Summer 2013
Watson at RPI - Summer 2013Watson at RPI - Summer 2013
Watson at RPI - Summer 2013James Hendler
 
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...Iman Mirrezaei
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Jeff Z. Pan
 
Linked Data at ISAW: How and Why
Linked Data at ISAW: How and WhyLinked Data at ISAW: How and Why
Linked Data at ISAW: How and Whyparegorios
 
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...Gong Cheng
 
The Neuroscience Information Framework: A Scalable Platform for Information E...
The Neuroscience Information Framework: A Scalable Platform for Information E...The Neuroscience Information Framework: A Scalable Platform for Information E...
The Neuroscience Information Framework: A Scalable Platform for Information E...Neuroscience Information Framework
 
Archives & the Semantic Web
Archives & the Semantic WebArchives & the Semantic Web
Archives & the Semantic WebMark Matienzo
 
Advance Data Mining Project Report
Advance Data Mining Project ReportAdvance Data Mining Project Report
Advance Data Mining Project ReportArnab Mukhopadhyay
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...Armin Haller
 
Interpretation, Context, and Metadata: Examples from Open Context
Interpretation, Context, and Metadata: Examples from Open ContextInterpretation, Context, and Metadata: Examples from Open Context
Interpretation, Context, and Metadata: Examples from Open ContextEric Kansa
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)Frank van Harmelen
 
¿ARCHIVO?
¿ARCHIVO?¿ARCHIVO?
¿ARCHIVO?ESPOL
 
que hisciste el verano pasado
que hisciste el verano pasadoque hisciste el verano pasado
que hisciste el verano pasadoespol
 
RDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SRDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SEmily Nimsakont
 

Similar to Data Provenance and its role in Data Science (20)

The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
 
Linked Data and Time – Modeling Researcher Life Lines by Events
Linked Data and Time – Modeling Researcher Life Lines by EventsLinked Data and Time – Modeling Researcher Life Lines by Events
Linked Data and Time – Modeling Researcher Life Lines by Events
 
Our World is Socio-technical
Our World is Socio-technicalOur World is Socio-technical
Our World is Socio-technical
 
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
 
Watson at RPI - Summer 2013
Watson at RPI - Summer 2013Watson at RPI - Summer 2013
Watson at RPI - Summer 2013
 
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
 
Linked Data at ISAW: How and Why
Linked Data at ISAW: How and WhyLinked Data at ISAW: How and Why
Linked Data at ISAW: How and Why
 
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
 
The Neuroscience Information Framework: A Scalable Platform for Information E...
The Neuroscience Information Framework: A Scalable Platform for Information E...The Neuroscience Information Framework: A Scalable Platform for Information E...
The Neuroscience Information Framework: A Scalable Platform for Information E...
 
Christine borgman keynote
Christine borgman keynoteChristine borgman keynote
Christine borgman keynote
 
Archives & the Semantic Web
Archives & the Semantic WebArchives & the Semantic Web
Archives & the Semantic Web
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
Advance Data Mining Project Report
Advance Data Mining Project ReportAdvance Data Mining Project Report
Advance Data Mining Project Report
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
 
Interpretation, Context, and Metadata: Examples from Open Context
Interpretation, Context, and Metadata: Examples from Open ContextInterpretation, Context, and Metadata: Examples from Open Context
Interpretation, Context, and Metadata: Examples from Open Context
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)
 
¿ARCHIVO?
¿ARCHIVO?¿ARCHIVO?
¿ARCHIVO?
 
que hisciste el verano pasado
que hisciste el verano pasadoque hisciste el verano pasado
que hisciste el verano pasado
 
RDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SRDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar S
 

More from Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 

More from Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Data Provenance and its role in Data Science

  • 1. DataScienceWorkshop Islamabad,April2017 P.Missier 1 Data Provenance and its role in Data Science Dr. Paolo Missier School of Computing Science Newcastle University, UK Data Science Workshop Islamabad, April 2017
  • 2. DataScienceWorkshop Islamabad,April2017 P.Missier 2 What is provenance? Oxford English Dictionary: • the fact of coming from some particular source or quarter; origin, derivation • the history or pedigree of a work of art, manuscript, rare book, etc.; • a record of the passage of an item through its various owners: chain of custody Magna Carta (‘the Great Charter’) was agreed between King John and his barons on 15 June 1215.
  • 3. DataScienceWorkshop Islamabad,April2017 P.Missier 3 A PROV provenance graph 3 Editing phase drafting commenting editingused draft v1 wasGeneratedBy used draft comments wasGeneratedBy used draft v2 wasGeneratedBy BobBob-1 Bob-2 specializationOf wasAssociatedWith specializationOf wasAssociatedWith reading wasDerivedFrom paper3 used Alice wasAssociatedWith actedOnBehalfOf wasDerivedFrom Remote past Recent past wasGeneratedBy distribution=internal status=draft version=0.1 distribution=internal status=draft version=0.1 type=person role=main_editortype=person role=jr_editor role=author role=editor role=author wasAttributedTo Publishing phase guideline update publication draft v2 used WD1 pub guidelines v1 wasGeneratedBy pub guidelines v2 wasGeneratedBy wasDerivedFrom Charlie wasAssociatedWith Alice actedOnBehalfOf w3c: consortium wasAssociatedWith distribution=public status=draft version=1.0 type=person role=headOfPublication type=institution role=issuer 3 Editing phase drafting commenting editingused draft v1 wasGeneratedBy used draft comments wasGeneratedBy used draft v2 wasGeneratedBy BobBob-1 Bob-2 specializationOf wasAssociatedWith specializationOf wasAssociatedWith reading wasDerivedFrom paper3 used Alice wasAssociatedWith actedOnBehalfOf wasDerivedFrom Remote past Recent past wasGeneratedBy distribution=internal status=draft version=0.1 distribution=internal status=draft version=0.1 type=person role=main_editortype=person role=jr_editor role=author role=editor role=author wasAttributedTo Publishing phase guideline update publication draft v2 used WD1 pub guidelines v1 wasGeneratedBy pub guidelines v2 wasGeneratedBy wasDerivedFrom Charlie wasAssociatedWith Alice actedOnBehalfOf w3c: consortium wasAssociatedWith distribution=public status=draft version=1.0 type=person role=headOfPublication type=institution role=issuer
  • 4. DataScienceWorkshop Islamabad,April2017 P.Missier 4 The W3C Working Group on Provenance W3C Incubator group on provenance Chair: Yolanda Gil, ISI, USC W3C working group approved Chairs: Luc Moreau, Paul Groth 2009-2010 Main output: “Provenance XG Final Report” http://www.w3.org/2005/Incubator/prov/XGR-prov/ - provides an overview of the various existing approaches, vocabularies - proposes the creation of a dedicated W3C Working Group April, 2011 April, 2013 Proposed Recommendations finalised prov-dm: Data Model prov-o: OWL ontology, RDF encoding prov-n: prov notation prov-constraints ...plus a number of non-prescriptive Notes http://www.w3.org/2011/prov/wiki/
  • 5. DataScienceWorkshop Islamabad,April2017 P.Missier 5 PROV: scope and structure source: http://www.w3.org/TR/prov-overview/ Recommendation track See also: Moreau, Luc, and Paul Groth. “Provenance: An Introduction to PROV.” Synthesis Lectures on the Semantic Web: Theory and Technology 3, no. 4 (September 15, 2013): 1–129. doi:10.2200/S00528ED1V01Y201308WBE007.
  • 6. DataScienceWorkshop Islamabad,April2017 P.Missier 6 PROV notation (PROV-N) document prefix prov <http://www.w3.org/ns/prov#> prefix ex <http://www.example.com/> entity(ex:draftComments) entity(ex:draftV1, [ ex:distr='internal', ex:status = "draft"]) entity(ex:paper1) entity(ex:paper2) activity(ex:commenting) activity(ex:drafting) wasGeneratedBy(ex:draftComments, ex:commenting, 2013-03-18T11:10:00) used(ex:commenting, ex:draftV1, -) wasGeneratedBy(ex:draftV1, ex:drafting, -) used(ex:drafting, ex:paper1, -) used(ex:drafting, ex:paper2, -) endDocument
  • 7. DataScienceWorkshop Islamabad,April2017 P.Missier 7 Same example — PROV-O notation :draftComments a prov:Entity ; :distr "internal"^^xsd:string ; prov:wasGeneratedBy :commenting . :commenting a prov:Activity ; prov:used :draftV1 . :draftV1 a prov:Entity ; :distr "internal"^^xsd:string ; :status "draft"^^xsd:string ; :version "0.1"^^xsd:string ; prov:wasGeneratedBy :drafting . :drafting a prov:Activity ; prov:used :paper1, :paper2 . :paper1 a prov:Entity, "reference"^^xsd:string . :paper2 a prov:Entity, "reference"^^xsd:string . (RDF / Turtle notation)
  • 8. DataScienceWorkshop Islamabad,April2017 P.Missier 8 Association, Attribution, Delegation: who did what? An activity association is an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity. Attribution is the ascribing of an entity to an agent. entity(ex:draftComments, [ ex:distr='internal' ]) activity(ex:commenting) agent(ex:Bob, [prov:type = "mainEditor"] ) agent(ex:Alice, [prov:type = "srEditor"]) wasAssociatedWith(ex:commenting, Bob, -, [prov:role = "editor"]) actedOnBehalfOf(Bob, Alice) wasAttributedTo(ex:draftComments, ex:Bob)
  • 9. DataScienceWorkshop Islamabad,April2017 P.Missier 9 Same example — PROV-O notation (RDF/N3) :Alice a prov:Agent, "ex:chiefEditor"; :firstName "Alice"; :lastName "Cooper". :Bob a prov:Agent, "ex:seniorEditor"; :firstName "Robert"; :lastName "Thompson"^; prov:actedOnBehalfOf :Alice . :draftComments prov:wasAttributedTo :Bob . :drafting a prov:Activity ; prov:wasAssociatedWith :Bob .
  • 10. DataScienceWorkshop Islamabad,April2017 P.Missier 10 Association and Attribution Q.: what is the relationship between attribution and association? This is defined as an inference rule in the PROV-CONSTR document entity(e) agent(Ag) activity(a) wasAttributedTo(e, Ag) wasGeneratedBy(e, a,-) wasAssociatedWith(a, Ag,-)
  • 12. DataScienceWorkshop Islamabad,April2017 P.Missier 12 Derivation amongst entities A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a pre-existing entity. entity(ex:draftV1) entity(ex:draftComments) wasDerivedFrom(ex:draftComments, ex:draftV1) Q.: what is the relationship between derivation, generation, and usage? :draftComments a prov:Entity ; prov:wasDerivedFrom :draftV1 . :draftV1 a prov:Entity .
  • 13. DataScienceWorkshop Islamabad,April2017 P.Missier 13 From “scruffy” provenance to “valid” provenance - Are all possible temporal partial ordering of events equally acceptable? - How can we specify the set of all valid orderings? - how do we formally define what it means for a set of provenance statements to be valid? PROV defines a set of temporal constraints that ensure consistency of a provenance graph
  • 14. DataScienceWorkshop Islamabad,April2017 P.Missier 14 Talk Outline • Provenance, why? (in science) • Provenance of Scientific Data • The DataONE Federation of Data Repositories (dataone.org) • Provenance for Data Science • Provenance-enabled data analytics frameworks • Provenance in the ReComp project • (Provenance for streaming data analytics)
  • 15. DataScienceWorkshop Islamabad,April2017 P.Missier 15 Talk Outline • Provenance, why? (in science) • Provenance of Scientific Data • The DataONE Federation of Data Repositories (dataone.org) • Provenance for Data Science • Provenance-enabled data analytics frameworks • Provenance in the ReComp project • Provenance for streaming data analytics
  • 16. DataScienceWorkshop Islamabad,April2017 P.Missier 16 Why provenance? Provenance in machine learning: • Why is my predictive algorithm recommending these new friends to me? • How can I trust my classifier’s predictions? [1] Ceolin, D., Groth, P., Maccatrozzo, V., Fokkink, W., Hage, W. R. Van, & Nottamkandath, A. (2016). Combining User Reputation and Provenance Analysis for Trust Assessment. J. Data and Information Quality, 7(1–2), 6:1--6:28. http://doi.org/10.1145/2818382 • Reproducibility of your own and your peers’ work • i.e. in experimental Science Example: assessing trust in Web artifacts and crowdsourced annotations [1] • Communication: To engender trust in the data and amongst the people and systems that are responsible for it • Understandability: • to explain the outcome of a complex decision process
  • 17. DataScienceWorkshop Islamabad,April2017 P.Missier 17 Trusted Web data: Provenance on the Web Tim Berners-Lee’s “Oh Yeah” button: http://users.ugent.be/~tdenies/OhYeah/ Easy Access to Provenance: an Essential Step Towards Trust on the Web, Procs METHOD 2013: The 2nd IEEE International Workshop on Methods for Establishing Trust with Open Data Held in conjunction with COMPSAC, the IEEE Signature Conference on Computers, Software & Applications - July 22-26, 2013 - Kyoto, Japan http://dx.doi.org/10.1109/COMPSACW.2013.29
  • 18. DataScienceWorkshop Islamabad,April2017 P.Missier 18 Understandability: explaining process outcomes • Which process was used to derive a diagnosis? • How did the process use the input data? • How were the steps configured? • Which decisions were made by human experts (clinicians)? MAF threshold - Non-synonymous - stop/gain - frameshift known polymorphisms Homo / Heterozygous Pathogenicity predictors Variant filtering HPO match HPO to OMIM OMIM match OMIM to Gene Gene Union Gene Intersect Genes in scope User-supplied genes list User-supplied disease keywords User-defined preferred genes Variant Scoping Candidate variants Select variants in scope variants in scope ClinVar lookupClinVar Annotated patient variants Variant Classification RED: found, pathogenic AMBER: not found GREEN: found, benign OMIM AMBER/ not found AMBER/ uncertain NGS pipeline Clinical diagnosis of genetic diseases Example provenance query: “Find all invocations that used a specific version of ClinVar and OMIM, and group them by phenotype”
  • 19. DataScienceWorkshop Islamabad,April2017 P.Missier 19 Reproducibility and dissemination in Science Experimental science is data-intensive Independent validation of results claims is a cornerstone of scientific discourse Provenance is the equivalent of a formal logbook • Capture all steps involved in the derivation of a result • How much detail? • Replay, validate, compare
  • 20. DataScienceWorkshop Islamabad,April2017 P.Missier 20 Lifecycle of experimental datasets Search discover packagepublish spec(P’) Deploy P’  Env(dep’) prov(D’) Compare (P,P’,D,D’) spec(P) prov(D) D  D1 P  P’ dep  dep’ <D,P,dep,spec(P), prov(D)> compute Env D’ D1
  • 21. DataScienceWorkshop Islamabad,April2017 P.Missier 21 Reproducibility: working. reporting submit article and move on… publish article Research Environment Publication Environment Peer Review
  • 22. DataScienceWorkshop Islamabad,April2017 P.Missier 22 Re-what? Re-* ReRun: vary experiment and setup, same lab P P’ DD’ depdep’ Repeat: Same experiment, setup, lab P, D, dep, env(dep) Replicate: Same experiment, setup, different lab P, D, dep, env’(dep) Reproduce: vary experiment and setup, different lab P P’ DD’ depdep’ env(dep) env’(dep’) Reuse: Different experiment D, P  Q
  • 23. DataScienceWorkshop Islamabad,April2017 P.Missier 23 Lifecycle with tools annotations Search discover packagepublish D  D1 P  P’ dep  dep’ compute Env D’ prov(D’) Compare (P,P’,D,D’) spec(P) prov(D) Research Objects DataONE Federated Research Data Repositories TOSCA-based virtualisation Pdiff: differencing provenance YesWorkflow - Workflow Provenance - NoWorkflow Matlab provenance recorder (DataONE) ReproZip spec(P’) Deploy P’  Env(dep’)
  • 24. DataScienceWorkshop Islamabad,April2017 P.Missier 24 References Research Objects: www.researchobject.org Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011). doi:doi:10.1016/j.future.2011.08.004. DataONE: dataone.org Cuevas-Vicenttín, Víctor, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati, Yaxing Wei, David Koop, and Saumen Dey. “The PBase Scientific Workflow Provenance Repository.” In Procs. 9th International Digital Curation Conference, 9:28–38. San Francisco, CA, USA, 2014. doi:10.2218/ijdc.v9i2.332. Process Virtualisation using TOSCA Qasha, Rawaa, Jacek Cala, and Paul Watson. “Towards Automated Workflow Deployment in the Cloud Using TOSCA.” In 2015 IEEE 8th International Conference on Cloud Computing, 1037–1040. New York, 2015. doi:10.1109/CLOUD.2015.146. NoWorkflow: provenance recording for Python Murta, Leonardo, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. “noWorkflow: Capturing and Analyzing Provenance of Scripts⋆.” In Procs. IPAW’14. Cologne, Germany: Springer, 2014. YesWorkflow: Qian Zhang, Yang Cao, Qiwen Wang, Duc Vu, Priyaa Thavasimani, Timothy McPhillips, Paolo Missier, Bertram Ludäscher, Revealing the Detailed Lineage of Script Outputs using Hybrid Provenance, Procs. International Data Curation Conference, DCC, Edinburgh, 2017.
  • 25. DataScienceWorkshop Islamabad,April2017 P.Missier 25 PDIFF Comparing provenance traces 15 A graph obtained as a result of traces “diff” which can be used to explain observed differences in workflow outputs, in terms of differences throughout the two executions. This is the simplest possible delta “graph”!d1 S0 d2 S1 z w S2 d3 yx S3 S4 df d1' S0 d2' S1 z w' S2 d3 y'x S3 S4 df' (i) Trace A (ii) Trace B P0 P1 P0 P1 P0 P1 P0 P1 dF , dF y, y w, w d2, d2 (iii) Delta tree Two executions of the same workflow, with slight differences: - Unintentional changes: eg incorrect porting/re-deployment  cause analysis - Intentional changes: eg different parameter settings  impact analysis Missier, P., Woodman, S., Hiden, H., & Watson, P. (2013). Provenance and data differencing for workflow reproducibility analysis. Concurrency and Computation: Practice and Experience, 28(4), 995–1015. http://doi.org/10.1002/cpe.3035
  • 26. DataScienceWorkshop Islamabad,April2017 P.Missier 26 Components for a flexible, scalable, sustainable network DataONE: Cyberinfrastructure www.dataone.org/member-nodes Coordinating Nodes • retain complete metadata catalog • indexing for search • network-wide services • ensure content availability (preservation) • replication services Member Nodes • diverse institutions • serve local community • provide resources for managing their data • retain copies of data
  • 27. DataScienceWorkshop Islamabad,April2017 P.Missier 27 Cyberinfrastructure Data Services: Extraction, sub-setting etc ontolog y annotation System Metadata Science Data Search API Science Metadata Provenance Replicate Metadata Index
  • 29. DataScienceWorkshop Islamabad,April2017 P.Missier 29 ProvONE: extending PROV with process structure https://purl.dataone.org/provone-v1-dev Yang Cao, Christopher Jones, Vıctor Cuevas-Vicenttın, Matthew B. Jones, Bertram Ludascher, Timothy McPhillips, Paolo Missier, Christopher Schwalm, Peter Slaughter, Dave Vieglais, Lauren Walker, Yaxing Wei, ProvONE: extending PROV to support the DataONE scientific community, TAPP workshop on Theory and Practice of Provenance, 2016. Workflow structure Retrospective provenance
  • 31. DataScienceWorkshop Islamabad,April2017 P.Missier 31 Database provenance • Why is record R included in the result of query? [why-provenance] (*) (*) Cheney, J., Chiticariu, L., & Tan, W.-C. (2009). Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases, 1, 379–474. (+) Herschel, M., & Hernández, M. A. (2010). Explaining missing answers to SPJUA queries. Proceedings of the VLDB Endowment, 3(1–2), 185–196. http://doi.org/10.14778/1920841.1920869 SIGMOD Tutorial 2007 Provenance in Databases 7 Example of Data Provenance n A typical question: n For a given database query Q, a database D, and a tuple t in the output of Q(D), which parts of D “contribute” to t? n The provenance of tuple (John, D01, Mary) in the output consists of the source facts R(John, D01) and S(D01, Mary) according to the query Q. n The question could also be applied to an attribute value, a table, or any subtree in hierarchical/tree-like data. R Emp Dept John D01 Susan D02 Anna D04 S Did Mgr D01 Mary D02 Ken D03 Ed Q = select r.A, r.B, s.C from R r, S s where r.B = s.B Q Emp Dept Mgr John D01 Mary Susan D02 Ken • Why is record R not in the result? [why-not provenance] (+) Source: Provenance in Databases: Past, Current, Future. Peter Buneman University of Edinburgh. Wang-Chiew Tan UC Santa Cruz. SIGMOD Tutorial 2007
  • 32. DataScienceWorkshop Islamabad,April2017 P.Missier 32 Talk Outline • Provenance, why? (in science) • Provenance, of Scientific Data: The DataONE Federation of Data Repositories (dataone.org) • Provenance for Data Science • Provenance-enabled data analytics frameworks • Provenance in the ReComp project • (Provenance for streaming data analytics)
  • 33. DataScienceWorkshop Islamabad,April2017 P.Missier 33 Provenance from analytics Analytics data processing generates potentially valuable knowledge But: Credibility of the outcomes requires - Understandability of data processing human-oriented: "Prospective provenance" - Provenance recording and query machine-oriented: PROV  XML, RDF, Neo4J data + query models Problem: - Process complexity, lack of transparency  - Provenance is coarse-grained or complex to understand
  • 34. DataScienceWorkshop Islamabad,April2017 P.Missier 34 Some research prototypes Titian (*) Apache Spark maintains the program transformation lineage to recover from failures • Titian enhances the Spark RDD programming model • data provenance capture • interactive query support (*) Interlandi, M., Shah, K., Tetali, S. D., Gulzar, M. A., Yoo, S., Kim, M., Condie, T. (2015). Titian: Data Provenance Support in Spark. Proc. VLDB Endow., 9(3), 216–227. http://doi.org/10.14778/2850583.2850595 Lipstick on a pig (+)(++) A framework that marries database- style and workflow provenance models The catch… all modules must be implemented in Pig Latin (+) Amsterdamer, Y., Davidson, S. B., Deutch, D., Milo, T., Stoyanovich, J., & Tannen, V. (2011). Putting lipstick on pig: enabling database-style workflow provenance. Proc. VLDB Endow., 5(4), 346–357. http://dl.acm.org/citation.cfm?id=2095686.2095693 (++) Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (2008). Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (pp. 1099–1110). New York, NY, USA: ACM. http://doi.org/http://doi.acm.org/10.1145/1376616.1376726
  • 35. DataScienceWorkshop Islamabad,April2017 P.Missier 35 Some research prototypes RAMP (**) (Reduce And Map Provenance) An extension to Hadoop that transparently wraps each function with provenance capture (**) Ikeda, R., Park, H., & Widom, J. Provenance for generalized map and reduce workflows. In: CIDR 2011.
  • 36. DataScienceWorkshop Islamabad,April2017 P.Missier 36 Provenance for analytics: Titian Apache Spark natively maintains the program transformation lineage so that it can reconstruct lost RDD partitions in the case of a failure • Titian enhances it with data provenance capture and interactive query support that extends the Spark RDD programming model • With limited overhead of less than 30% [1] Interlandi, M., Shah, K., Tetali, S. D., Gulzar, M. A., Yoo, S., Kim, M., Condie, T. (2015). Titian: Data Provenance Support in Spark. Proc. VLDB Endow., 9(3), 216–227. http://doi.org/10.14778/2850583.2850595 LineageRDD methods for traversing through the data lineage in both backward and forward directions [1] Job workflow after adding the lineage capture points
  • 37. DataScienceWorkshop Islamabad,April2017 P.Missier 37 Provenance for analytics: “Lipstick on a pig” [2] Amsterdamer, Y., Davidson, S. B., Deutch, D., Milo, T., Stoyanovich, J., & Tannen, V. (2011). Putting lipstick on pig: enabling database-style workflow provenance. Proc. VLDB Endow., 5(4), 346–357. http://dl.acm.org/citation.cfm?id=2095686.2095693 [3] Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (2008). Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (pp. 1099–1110). New York, NY, USA: ACM. http://doi.org/http://doi.acm.org/10.1145/1376616.1376726 “A framework that marries database-style and workflow provenance models capturing internal state as well as fine-grained dependencies in workflow provenance” The catch… all modules must be implemented in Pig Latin: “an emerging language that combines high-level declarative querying with low-level procedural programming and parallelization in the style of map-reduce” [3]
  • 38. DataScienceWorkshop Islamabad,April2017 P.Missier 38 Provenance for analytics: Map-Reduce [4] Ikeda, R., Park, H., & Widom, J. (2011). Provenance for generalized map and reduce workflows. In: CIDR 2011. Scope: generalized map and reduce workflows (GMRWs) • input data sets are processed by an acyclic graph of map and reduce functions Transformations: Map, Reduce, Union, Split • Each transformation has an associated provenance operator RAMP (Reduce And Map Provenance), an extension to Hadoop that transparently wraps each function with provenance capture:
  • 39. DataScienceWorkshop Islamabad,April2017 P.Missier 39 Provenance for analytics: Map-Reduce [5] Crawl, D., Wang, J., & Altintas, I. (2011). Provenance for MapReduce-based data-intensive workflows. In Proceedings of the 6th workshop on Workflows in support of large-scale science (pp. 21–30). Kepler+Hadoop framework Works for Kepler workflows that invoke Hadoop jobs Capture provenance inside the MapReduce job as
  • 40. DataScienceWorkshop Islamabad,April2017 P.Missier 40 Provenance for analytics: Map-Reduce [6] Murray, D. G., McSherry, F., Isard, M., Isaacs, R., Barham, P., & Abadi, M. (2016). Incremental, iterative data processing with timely dataflow. Communications of the ACM, 59(10), 75–83. http://doi.org/10.1145/2983551 Extend the MapReduce framework with change propagation • The framework keeps track of the dependencies between subsets of each MapReduce computation • When a subset of the input changes, rebuilds only the parts of the computation and the output affected by the changes
  • 41. DataScienceWorkshop Islamabad,April2017 P.Missier 41 Challenges • Too little, too much provenance • Not at the right level of abstraction Complex, black box analytics  coarse-grained provenance White-box analytics frameworks  fine-grained, too detailed • need abstraction / view mechanisms! Ad hoc analytics (eg Python, R) ?? Need flexible, user-level provenance capture from complex analytics
  • 42. DataScienceWorkshop Islamabad,April2017 P.Missier 42 Additional recent research on Provenance and Big Data Chen, Peng; Plale, Beth A., "Big Data Provenance Analysis and Visualization," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.797,800, 4-7 May 2015 doi: 10.1109/CCGrid.2015.85 Chen, Peng; Plale, Beth A., "ProvErr: System Level Statistical Fault Diagnosis Using Dependency Model," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.525,534, 4-7 May 2015 doi: 10.1109/CCGrid.2015.86 Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs Peter Macko and Margo Seltzer, Harvard University, Procs. TAPP’11, 2011, Crete, Greece Provenance from Log Files: a BigData Problem, Devarshi Ghoshal and Beth Plale, Procs. BigProv workshop, EDBT, Genova, Italy, 2013 Adam Bates, Kevin Butler and Thomas Moyer. Take Only What You Need: Leveraging Mandatory Access Control Policy to Reduce Provenance Storage Costs. In Procs. TAPP’15 workshop, Edinburgh, 2015 http://workshops.inf.ed.ac.uk/tapp2015/TAPP15_II_3.pdf
  • 43. DataScienceWorkshop Islamabad,April2017 P.Missier 43 Talk Outline • Provenance, why? (in science) • Provenance of Scientific Data • The DataONE Federation of Data Repositories (dataone.org) • Provenance for Data Science • Provenance-enabled data analytics frameworks • Provenance in the ReComp project • (Provenance for streaming data analytics)
  • 44. DataScienceWorkshop Islamabad,April2017 P.Missier 44 ReComp Metadata analytics: Provenance for selective re-computation of big data analytics Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Funded by the EPSRC on the Making Sense from Data call (2016 – 2019) http://recomp.org.uk/
  • 45. DataScienceWorkshop Islamabad,April2017 P.Missier 45 Example: NGS variant calling and clinical interpretation Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Also: Metagenomics: Species identification. Eg The EBI metagenomics portal Workflow helps confirm/reject hypotheses of patient’s phenotype Classifies variants into three categories: RED, GREEN, AMBER (pathogenic, benign and unknown/uncertain)
  • 46. DataScienceWorkshop Islamabad,April2017 P.Missier 46 ReComp Record execution history • transparency Detect and measure changes Estimate impact of changes • Scoping • Prioritisation Enact on demand • Partial re-run • Differential execution History DB Data diff(.,.) functions Changes: • Algorithms and tools • Accuracy of input sequences • Reference databases Change Events
  • 47. DataScienceWorkshop Islamabad,April2017 P.Missier 47 Goal: Minimise re-comp effort across all prior outcomes Objective: • Reduce the amount of computation performed in reaction to changes Constraint: selective re-computation should be lossless • All instances that may be subject to impact must be considered • P1: Partial re-execution • Reduce re-computation to only those parts of a process that are actually involved in the processing of the changed data • P2: Differential execution • Insight: If an instance I of process P is executed using the delta between two versions of the inputs and it produces empty result, then I is not affected by the version change • Only feasible if some algebraic properties of the process hold • P3: Identifying the scope of change • Determine which instances out of a population of outcomes are going to be affected by the change
  • 48. DataScienceWorkshop Islamabad,April2017 P.Missier 48 SVI process: implementation using workflow Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype hypothesis e-Science Central WFMS
  • 49. DataScienceWorkshop Islamabad,April2017 P.Missier 49 The ProvONE provenance data model Workflow structure Retrospective provenance
  • 50. DataScienceWorkshop Islamabad,April2017 P.Missier 50 P1: Partial re-execution Objective: find the minimal sub-graph of a workflow that is affected by a change Approach: • e-SC generates one ProvONE provenance trace for each workflow run • Use traces to identify the minimal sub-workflow that is affected by the change :- invocation(I), wasPartOf(A,I), wasDerivedFrom(d’,Dep),used(A,Dep) Initial query: given a new version d’ of a reference database d: Variable I denotes one execution instance (workflow invocation) Traversal queries: :- execution(A1), execution(A2), wasPartOf(A,I), wasInformedBy(A2,A1) collect all other activities connected to A (implicit data transfer) :- execution(A1), execution(A2), wasGeneratedBy(Data,A1), used(A2,Data) collect all other activities connected to A (explicit data transfer D from A1 to A2) Find all activities A within invocation I, that used a prior version Dep of d’
  • 51. DataScienceWorkshop Islamabad,April2017 P.Missier 51 Minimal sub-graphs in SVI Change in ClinVar Change in GeneMap Partial execution following a change in only one of the databases requires caching the intermediate data at the boundary of the blue and read areas
  • 52. DataScienceWorkshop Islamabad,April2017 P.Missier 52 Results • How much can we save? • Process structure • First usage of reference data • Overhead: storing interim data required in partial re-execution • 156 MB for GeneMap changes and 37 kB for ClinVar changes Time savings Partial re-execution (sec) Complete re-execution (sec) Time saving (%) GeneMap 325 455 28.5 ClinVar 287 455 37
  • 53. DataScienceWorkshop Islamabad,April2017 P.Missier 53 P2: Differential execution Suppose D is a relation (a table). diffD() can be expressed as: Where: The idea is to compute: as the combination of: This is effective if: If the operators that make up P satisfy certain algebraic properties(*), then this can be achieved as follows: (*) Associative, distributive over set union and difference
  • 54. DataScienceWorkshop Islamabad,April2017 P.Missier 54 P2: Partial re-computation using input difference Insight: run SVI but replace ClinVar query with a query on ClinVar version diff: Q(CV)  Q(diff(CV1, CV2)) Works for SVI, but hard to generalise: depends on the type of process Gain: diff(CV1, CV2) much smaller than CV2 GeneMap versions from –> to ToVersion rec. count Difference rec. count Reduction 16-03-08 –> 16-06-07 15910 1458 91% 16-03-08 –> 16-04-28 15871 1386 91% 16-04-28 –> 16-06-01 15897 78 99.5% 16-06-01 –> 16-06-02 15897 2 99.99% 16-06-02 –> 16-06-07 15910 33 99.8% ClinVar versions from –> to ToVersion rec. count Difference rec. count Reduction 15-02 –> 16-05 290815 38216 87% 15-02 –> 16-02 285042 35550 88% 16-02 –> 16-05 290815 3322 98.9%
  • 55. DataScienceWorkshop Islamabad,April2017 P.Missier 55 P3: Identifying the scope of change: a game of battleship Patient / change impact matrix Challenge: precisely identify the scope of a change Blind reaction to change: recompute the entire matrix Can we do better? - Hit the high impact cases (the X) without re- computing the entire matrix
  • 56. DataScienceWorkshop Islamabad,April2017 P.Missier 56 A scoping algorithm Coarse-grained provenance indicates whether or not a dependency on D existed … but not which specific data from version Dt of D was used Candidate invocation: Any invocation I of P whose provenance contains statements of the form: used(A,Dt), wasPartOf(A,I), wasAssociatedWith(I,_,P) used(I,Dt),wasAssociatedWith(I,_,P) - For each candidate invocation I of P: - partially re-execute using the difference sets as inputs - compute the minimal subgraph P’ of P that needs re-computation - repeat: execute P’ one step at-a-time until <empty output> or <P’ completed> - If <P’ completed> then - Execute P’ on the full inputs Sketch of the algorithm (simplified):
  • 57. DataScienceWorkshop Islamabad,April2017 P.Missier 57 Scoping: precision • The approach avoids the majority of re-computations given a ClinVar change • Reduction in number of complete re-executions from 495 down to 71
  • 58. DataScienceWorkshop Islamabad,April2017 P.Missier 58 Scoping: efficiency Total re-computation time for the whole patient cohort 0 1 2 3 4 5 Executiontime[hours] ClinVar update date [mm/yy] CV blind CV selective process CV selective scope, δ-gen CV selective scope, δ-SVI We expect to pay a penalty for running the algorithm when the difference sets are large compared to actual new data More accurate diff functions result in higher runtime savings • Smaller difference sets • More precise scoping
  • 59. DataScienceWorkshop Islamabad,April2017 P.Missier 59 Provenance in ReComp: Summary and Challenges Objective: • Reduce the amount of computation performed in reaction to changes 1. Partial re-execution of previously computed workflows 2. (Differential execution) 3. Identifying the scope of change • Makes use of (2) to determine which instances of a population of outcomes are going to be affected by a change Challenges / work in progress: • Validate and extend the approach to other case studies • Design estimators to predict the impact of change • Design and implement a generic ReComp meta-process • Observe P in execution • Detect changes • Selectively react to changes
  • 60. DataScienceWorkshop Islamabad,April2017 P.Missier 60 Talk Outline • Provenance, why? (in science) • Provenance, of what? • Of Scientific Data: The DataONE Federation of Data Repositories (dataone.org) • Of database data (very briefly) • Of Web data  the W3C PROV data model for provenance • Provenance for Data Science • Provenance-enabled data analytics frameworks • Provenance in the ReComp project • Provenance for streaming data analytics
  • 61. DataScienceWorkshop Islamabad,April2017 P.Missier 61 Process-specific provenance using templates Process definition: P: X --> Y Provenance template T Sidecar binding process definition: PB: <xi,yi,P> --> B Execution: xi --> Pi --> yi <xi,yi,P> PB B Apply(B,T) PROV document P Aim: enable provenance generation from black box analytics Approach: 1) human-oriented: "Prospective provenance" --> YesWorkflow 2) machine-oriented: PROV --> XML, RDF, Neo4J data + query models The resulting provenance is • Application-level • User-specified • Coarse-grained
  • 62. DataScienceWorkshop Islamabad,April2017 P.Missier 62 Illustration: the case of map() [y1 … yn] = map(λ x.f(x), [x1 … xn]) Process definition: map: <X, lambda x: f(x)> --> Y Provenance template T :x :y f :a Execution: [y1 … yn] = map(lambda x: f(x), [x1 … xn]) :y. :a. f :x Binding: B = { <:x ← x1, :y ← y1, :a ← map1>, … <:x ← xn, :y ← yn, :a ← mapn> } :x1 :map1 :y1 Type:plan Type:plan f :xn :yn:map1 Apply(B,T) Sidecar process BP used used used gen gen gen assoc assoc
  • 63. DataScienceWorkshop Islamabad,April2017 P.Missier 63 PROV-N rendering of the map() case [y1 … yn] = map(λ x.f(x), [x1 … xn]) Provenance template T entity(f, [prov:type = ‘prov:plan']) entity(:x) entity(:y) activity(:a), wasAssociatedWith(:a,_,f) used(:a,:x), wasGeneratedBy(:y,:a) wasDerivedFrom(:y,:x) B = {<:a ← gen(a, i), :x ← xi, :y ← yi > |i : 1 . . . n} entity(f, [prov:type = ‘prov:plan']) activity(a_1), wasAssociatedWith(a_1,_,f) … activity(a_n), wasAssociatedWith(a_n,_,f) entity(x_1), ..., entity(x_n) entity(y_1), ... ,entity(y_n) used(a_1, x_1), wasGeneratedBy(y_1, a_1) … used(a_n, x_n), wasGeneratedBy(y_n, a_n) wasDerivedFrom(y_1,x_1) … wasDerivedFrom(y_n,x_n) Prov = apply(B,T)
  • 64. DataScienceWorkshop Islamabad,April2017 P.Missier 64 Application of template approach to streaming analytics Data in movement is a prime source for value-added analytics applications - Eg data streams from Internet of Things devices The provenance of an output data stream is a stream of provenance statements The template / sidecar process / binding framework applies without changes: Following the Spark streaming model: A stream is discretised into a sequence of micro-batch intervals Window W = [ x1 … xk]: user-defined sequence of intervals xi Process P: Wi  Y operates on a window at a time (Y may be a sequence or other multivariate data structure)
  • 65. DataScienceWorkshop Islamabad,April2017 P.Missier 65 Provenance streams :w :y:a used gen A simple template: ”each :y is generated using the content of one window” B1 P , B2 P , … invoked alongside each invocation of P1, P2 ,… Each Bi P produces a binding Bi Each Bi is applied to the template, producing a PROV document Provi … this results in a stream of provenance alongside the stream of outputs yi Execution: wi = [xi1 … xin] y1 = P1 (w1)  B1 = B1 P (w1, P1, y1) y2 = P2 (w2)  B2 = B2 P (w2, P2, y2) … :y :a :w Prov1 = apply(B1,T) Prov2 = apply(B2,T) …
  • 66. DataScienceWorkshop Islamabad,April2017 P.Missier 66 Extensions The framework applies to a stateful process P - P’s outcome depends on an internal state S - P’s execution may modify S Hint: S is itself an entity with provenance, defined by its own template and bindings Add flexibility by allowing multiple templates and multiple sidecar processes for the same process execution The actual provenance output at the end of the process: Prov = apply(B,T) is an extensional representation. < T, B> is an intensional representation More space-efficient (T is only stored once, B is a set of variable-value pairs) Only serialised when needed
  • 67. DataScienceWorkshop Islamabad,April2017 P.Missier 67 Summary • Provenance, why? (in science) • Provenance, of Scientific Data: The DataONE Federation of Data Repositories (dataone.org) • Provenance for Data Science • Provenance-enabled data analytics frameworks • Provenance in the ReComp project • (Provenance for streaming data analytics)
  • 69. DataScienceWorkshop Islamabad,April2017 P.Missier 69 Selected bibliography Moreau, Luc, Paolo Missier, Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell, et al. PROV-DM: The PROV Data Model. Edited by Luc Moreau and Paolo Missier, 2012. http://www.w3.org/TR/prov-dm/ Cheney, James, Paolo Missier, and Luc Moreau. Constraints of the Provenance Data Model, 2012. http://www.w3.org/TR/prov-constraints/ Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. “The Rationale of PROV.” Web Semantics: Science, Services and Agents on the World Wide Web (April 2015). doi:10.1016/j.websem.2015.04.001. http://www.sciencedirect.com/science/article/pii/S1570826815000177 Marinho, Anderson, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio Manuel Serra da Cruz, Eduardo Ogasawara, and Marta Mattoso. “ProvManager: a Provenance Management System for Scientific Workflows.” Concurrency and Computation: Practice and Experience 24, no. 13 (2012): 1513–1530. http://dx.doi.org/10.1002/cpe.1870. ProvGen: generating synthetic PROV graphs with predictable structure. Firth, H.; and Missier, P. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.2495 ProvAbs: model, policy, and tooling for abstracting PROV graphs. Missier, P.; Bryans, J.; Gamble, C.; Curcin, V.; and Danger, R. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.1998 De Oliveira, Daniel, Vítor Silva, and Marta Mattoso. “How Much Domain Data Should Be in Provenance Databases?” In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15). Edinburgh, Scotland: USENIX Association, 2015. https://www.usenix.org/conference/tapp15/workshop- program/presentation/de-oliveira.

Editor's Notes

  1. W3C Recommendation (REC) A W3C Recommendation is a specification or set of guidelines that, after extensive consensus-building, has received the endorsement of W3C Members and the Director. W3C recommends the wide deployment of its Recommendations. Note: W3C Recommendations are similar to the standards published by other organizations.
  2. remark on PROV-AQ: nothing to do with querying, but a query model can be associated to each of the encodings W3C Recommendation (REC) A W3C Recommendation is a specification or set of guidelines that, after extensive consensus-building, has received the endorsement of W3C Members and the Director. W3C recommends the wide deployment of its Recommendations. Note: W3C Recommendations are similar to the standards published by other organizations. Working Group Note A Working Group Note is published by a chartered Working Group to indicate that work has ended on a particular topic. A Working Group may publish a Working Group Note with or without its prior publication as a Working Draft.
  3. baseline-noAgents.provn
  4. baseline-noAgents-unqual.n3
  5. baseline-noAgents.provn agents are software, organization, person -- non-normative distinguish between normative and non-normative parts of the PROV documents Examples of association between an activity and an agent are: creation of a web page under the guidance of a designer; various forms of participation in a panel discussion, including audience member, panelist, or panel chair; a public event, sponsored by a company, and hosted by a museum;
  6. baseline-noAgents-unqual.n3 agents are software, organization, person -- non-normative distinguish between normative and non-normative parts of the PROV documents Examples of association between an activity and an agent are: creation of a web page under the guidance of a designer; various forms of participation in a panel discussion, including audience member, panelist, or panel chair; a public event, sponsored by a company, and hosted by a museum;
  7. agents are software, organization, person -- non-normative distinguish between normative and non-normative parts of the PROV documents Examples of association between an activity and an agent are: creation of a web page under the guidance of a designer; various forms of participation in a panel discussion, including audience member, panelist, or panel chair; a public event, sponsored by a company, and hosted by a museum;
  8. A browser button by which the user can express their uncertainty about a document being displayed “so how do I know I can trust this information?”. Upon activation of the button, the software then retrieves metadata about the document, listing assumptions on which trust can be based.
  9. hdb_store_2:used(_, CV1Import, CV1, _, _), hdb_store_2:wasPartOf(CV1Import, WI1), %hdb_store_2:execution(WI1, WIst, WIet, WIAttrs), parse_time(WIst, WIStartTS), date_time_stamp(date(2016, 12, 8), Today), WIStartTS @>= Today, %hdb_store_2:execution(WI1, WIst, WIet, WIAttrs), split_string(WI1, "/", "", WITokens), last(WITokens, WInvId), number_string(WInvIdNo, WInvId), WInvIdNo > 55900, %sub_string(CV1Attrs.get('prov:label'), 0, _, _, "variant_summary-"), sub_string(CV1Attrs.get('prov:label'), _, _, 0, "Del.csv"), hdb_store_2:used(_, CV1Import, CV1, _, _), hdb_store_2:wasPartOf(CV1Import, WI1), hdb_store_2:wasPartOf(GM1Import, WI1), hdb_store_2:used(_, GM1Import, GM1, _, _), hdb_store_2:document(GM1, GM1Attrs), split_string(GM1, "/", "", GM1Tokens), append(_, [GM1DocId, _], GM1Tokens), % append is to take the second to last element of the list member(GM1DocId, hdb_store_2:wasPartOf(Out1Export, WI1), hdb_store_2:wasGeneratedBy(_, Out1, Out1Export, _, _), hdb_store_2:document(Out1, Out1Attrs), Out1Attrs.get('prov:label') == "svi-classification.csv", hdb_store_2:wasPartOf(PV1Import, WI1), PV1Import \= CV1Import, PV1Import \= GM1Import, hdb_store_2:used(_, PV1Import, PV1, _, _), hdb_store_2:document(PV1, PV1Attrs),
  10. To what extent can these be formalised and automated?
  11. Need to update with new / upcoming MN locations and logos Amber notes: Retain CN, MN logo? Required if used elsewhere, if not cut? Not all MN logos will fit – select representative or cut? Cross reference with google MN Rebecca: Need updated logos for KNB, AOOS (FIXED) – I would select a different set of MNs to highlight since all won’t fit
  12. Rebecca: Can we do a better job than the quad chart? If not, are all the logos in 1st quadrant appropriate?
  13. Update before RSV Figure shows from 2020 – edit?
  14. Data in movement (the Velocity dimension of Big Data) is gaining prominence as a prime source of data for analytics applications. Data streams generated by Internet of Things devices, for instance, are a rich source of implicit signals about the habits of individuals that operate those devices, e.g. in their smart homes, smart cars, through wearables, etc.
  15. . The native Spark compute method is used to plug a LineageRDD instance into the Spark dataflow (described in Section 4).
  16. Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
  17. (note we need to restrict the query to the specific invocation I, in case other workflow may have used the output of I)
  18. Also, as in Tab. 2 and 3 in the paper, I’d mention whether this reduction was possible with generic diff function or specific function tailored to SVI. What is also interesting and what I would highlight is that even if the reduction is very close to 100% but below, the cost of recomputation of the process may still be significant because of some constant-time overheads related to running a process (e.g. loading data into memory). e-SC workflows suffer from exactly this issue (every block serializes and deserializes data) and that’s why Fig. 6 shows increase in runtime for GeneMap executed with 2 \deltas even if the reduction is 99.94% (cf. Tab. 2 and Fig. 6 for GeneMap diff between 16-10-30 –> 16-10-31).
  19. Regarding the algorithm, you show the simplified version (Alg. 1). But please take also look on Alg. 2 and mention that you can only run the loop if the distributiveness holds for all P in the downstream graph. Otherwise, you need to break and re-execute on full inputs just after first non-distributive task produces a non-empty output. But, obviously, the hope is that with a well tailored diff function the output will be empty for majority of cases.
  20. This figure emphasizes the penalty for running the algorithm when the differ- ence sets were large compared to actual new data. But it also highlights the importance of the diff and impact functions. Clearly, the more accurate the functions are the higher runtime savings may be, which stems from two fact. Firstly, more accurate diff function tends to pro- duce smaller difference sets which reduces time of task re-execution (cf. CV-diff and CV-SVI-diff lines in Fig. 7). Secondly, more accurate impact function tends to produce false more frequently, and so the algorithm can more of- ten avoid re-computation with the complete new version of the data (cf. the number of black squares vs the total number of patients affected by a change in Tab. 4)
  21. Differential excution: If I can execute P again using the changes between old and new input, and the result is empty, then I may conclude that P did not use any of the elements in the difference set
  22. Map is well defined / white box and well understood so a good starting point to appreciate the template and binding idea The first element is a  provenance template.  <show template T1>. --> show a graph using provtoolbox then we have a Binding along with a binding generation process <example B1> Finally apply(B, T ) Show the final prov graph using provtoolbox
  23. Spark Streaming uses a micro-batch architecture, whereby a streaming computation is achieved through a sequence of batch computations, each operating on a fragment of the stream defined by a configurable batch interval. Each micro-batch consists of a finite sequence of data structures (RDDs, Spark's main data abstraction), known as a DStream. we may view a DStream within each batch as a list of values, like those we have used in our batch examples. Thus, just as it allows Spark Streaming to reuse most of its RDD transformation and action operators, micro-batching also makes our framework reusable for generating provenance over streams. Spark also supports windows as application-friendly abstractions over streams. A window is simply a sequence of contiguous micro-batches, and a breakdown of the stream into windows is specified in the usual way, namely by a combination of window size (the number of micro-batches) and sliding duration. In this setting, the template T is defined so that it ap-plies to each window, and the sidecar process executes once for each window, producing a stream of bindings B1, B2, . . . . Correspondingly, this triggers a sequence of calls: apply(B1, T ), apply(B2, T ), . . . , resulting in a stream of provenance statements alongside the corresponding data input and output streams.