Multidisciplinary engineer and entrepreneur David Wood discusses the reasons, approaches and success stories for structured data on the World Wide Web. Linked Data is placed in context with the rest of the Web and that context is used to suggest some areas ripe for entrepreneurial innovation.
4. David Wood
RDF Database
RDF Database
Management
RDF Usage ongoing
Linked Data
Management
ongoing
company founded products disposition
2002
2005
@𝛑Plugged In Software
8. 40% annual growth in data produced
5% annual growth in IT spending
1.8 ZB
35 ZB
2012 2020
Digital Information Produced
294B
1 Trillion
2 Trillion
3 Trillion
4 Trillion
5 Trillion
Online Ad
Impressions
Emails Tweets
Daily (2013)
230M
4.8T
14. “The Web is the minimal concession to
hypertext that a sequence-and-hierarchy
chauvinist could possibly make.”
“HTML is precisely what we were trying to
PREVENT-- ever-breaking links, links
going outward only, quotes you can't
follow to their origins, no version
management, no rights management.”
“The "Browser" is an extremely silly
concept-- a window for looking sequentially
at a large parallel structure. It does not
show this structure in a useful way.”
29. New Data Requirements
• Global access
• Open format
• Record context
• to allow sharing
• to allow reuse
• Record provenance
30. Challenges
• Global access: Need to publish to the Web
• Open format: Most data currently bound
to proprietary tools/formats
• Context: Data often structured for
individual use without thought to sharing
• Provenance: Paradoxically easy given
solutions to the others
31. Linked Data on the Web
my data
collector
collected by
measurement
Michael
first name
Hausenblaslast name
Person
a
a measurement
2011-01-01
date
0
value
units of measure
degrees
Centigrade
...
Galway Airport
collected at
or
39. HTTP-accessible endpoints capable of returning XML or textual content
Convert XML or textual results to
RDF
Render RDF to HTML via template
User resolves a
single URI to an
Active PURL
Multiple targets queried
independently
1
David Wood1 and Tom Plasterer2
1david@3roundstones.com, 2Tom.Plasterer@astrazeneca.com
Active PURLs for Clinical Study Aggregation
The problem: No coordinated view of clinical study information. Information is distributed across departments, subsidiaries and government data sources.
The solution: Gather, convert, aggregate and format for display
Challenges
Next steps
How semantic technologies help
3 Round Stones and AstraZeneca created a system to allow coordinated views of distributed clinical trial information. The system extended the Callimachus
Project, an Open Source management system for Linked Data.
Persistent URLs, or PURLs, were used to provide globally unique and resolvable identifiers for each clinical study. The PURL concept was extended to enable
PURLs to have multiple targets and for the results of each target to undergo arbitrary transformation. PURLs which have such capabilities are called Active PURLs.
Information sources relevant to clinical studies were identified, regardless of whether their location was internal or external to the pharmaceutical company's
network. Active PURLs were used to resolve data sources having HTTP endpoints capable of returning XML or textual results. Each information source is
dynamically transformed into Resource Description Framework (RDF) formats and all sources' results then merged into a single, temporary graph of RDF data.
Information is rendered to end users as coordinated HTML descriptions regarding each clinical trial using the Callimachus template engine. Machine-readable
versions of the data are also available.
Linked Data techniques can help to address both the availability of clinical trial information and provide a means to build effective information systems using it.
Linked Data techniques allow for "cooperation without coordination". Publishers of data provide context for use by third parties in other portions of a distributed
enterprise. Users of Linked Data can combine information from multiple sources. Subsequent publication can create a virtuous circle of positive feedback, allowing
researchers, informaticists and support staff to collaboratively and distributively build a reusable knowledge base.
Distributed queries have many known
limitations, such as the introduction of
multiple single points of failure in any
given PURL resolution. HTTP timeouts,
auth/auth errors or other network failures
can slow or stop a pipeline from returning
correctly.
Similarly, distributed queries can result
in variant query-time performance due to
complex network and endpoint perform-
ance variances.
Proactive caching and cache manage-
meant strategies can improve runtime
performance and protect end users from
the limitations inherent in a distributed
query architecture. Caching of
intermediate results from endpoints has
not yet been implemented.
References
User experience
Users resolve a URL that
provides a unique identifier for
a clinical study, drug, chemical
or other concept managed by
this system. The user may
be presented with the URL on
HTML pages, search it via full-
text techniques or discover it
via semantic search.
1
2 Users are presented with a
dynamically generated Web
page representing aggregated
clinical study information. Users
are isolated from the complex
and distributed information
environment.
40.
41. • Linked Data warehouses
10B USD annually
• Linked Data supply chains
205M USD annually (Web)
6B USD annually (enterprise)
• Linked Data analytics
16B USD annually
Your Opportunity?