Presented as a Tutorial at the 2023 Knowledge Graph Conference, this deck explores different ways that information can be transformed across knowledge portals, from basic RDF structures to the use of SPARQL UPDATE based Workflows. It then explores how ChatGPT can be used to expand upon this transformation capability, and why knowledge portals should be considered transformation engines for graphs.
2. Who is Kurt Cagle?
Editor, The Cagle Report (https://thecaglereport.com)
Email: kurt.cagle@thecaglereport.com
Linked In: https://linkedin.com/in/kurtcagle
Calendly: https://calendly.com/semantical. Open office hours.
I like graphs, large language models, metadata, AI, future of work
I consider myself a data therapist. Book a free hour if you want to talk.
3. Purpose of This Class
The purpose of this class is to teach you transformation techniques for the
knowledge portal
I’m assuming you have a basic working knowledge of RDF and SPARQL, and have
at least heard of SHACL.
When done, you should have new tools for making your data sit up and beg, and
hopeful have a new way of thinking about the RDF Stack
4. Warning!!!
It is likely that somewhere in this class, I will
skewer a sacred cow or five.
There are many different ways of building
ontologies, many different best practices.
Frankly, many of those best practices are no
longer relevant, or represent ways of thinking
that should be changed in the face of advances
in the technology.
Hopefully, even if you disagree with the author,
there will be useful information contained
herein, but go in with an open mind and a
willingness to at least question your beliefs, if
not necessarily change them.
5. Transformations
A KNOWLEDGE PORTAL IS
NOT A QUERY SYSTEM,
BUT A
TRANSFORMATIONAL
ONE.
IT TRANSFORMS
INFORMATION FROM ONE
ONTOLOGY TO ANOTHER
IT TRANSFORMS DATA
FROM ONE FORMAT TO
ANOTHER
IT CREATES NEW
KNOWLEDGE FROM OLD
KNOWLEDGE.
UNLESS YOU
UNDERSTAND
KNOWLEDGE GRAPH
TRANSFORMATIONS, YOU
ARE NOT GETTING
EVERYTHING OUT OF
YOUR KNOWLEDGE
PORTAL.
7. Namespaces
Namespaces underlay a great deal of semantics, but often tend to be poorly
utilized. Especially with regards to ontological modeling, the following techniques
may be useful (and are used throughout this deck).
Use namespaces that correspond to classes. For instance, a Character class might
have an associated namespace:
Namespace: http://comicsdata.org/ns/Character#
PREFIX Character: <http://comicsdata.org/ns/Character#>
The class can then be specified with the prefix in Turtle:
Character:Catwoman a Character: .
All triple stores support this notation after around 2017.
8. Namespace Construction
The use of class-based namespaces can simplify contructing and
deconstructing URIs in SPARQL:
str(Character:) -> “http://comicsdata.org/ns/Character#”
strafter(str(Character:Catwoman),str(Character:)) -> “Catwoman”
iri(concat(str(Character:),”Catwoman”)) ->
<http://comicsdata.org/ns/Character#Catwoman> ->
Character:Catwoman
In some cases, you do not need to explicitly convert URIs to
strings.
iri(concat(Character:,”Catwoman”))
9. Namespace Best Practices
Store namespace and prefix strings in SHACL NodeShape
declarations.
This makes it easier to construct contexts in SPARQL and JSON
Filepaths can be stored as namespaces.
E.g., PREFIX basePath: file:///path/to/root/
basePath:foo/bar.ttl -> <file:///path/to/root/foo/bar.ttl>
RDF-XML handles this notation just fine.
This notation is friendlier to JSON-LD.
10. Namespace
Anti-Patterns
Stop trying to map URIs to URLs. It
makes URIs brittle.
If you must, designate an ontology URI
than can be mapped.
Instance URIs are not that important,
property and class URIs are.
Do not use ontology import predicates –
use SPARQL Update LOAD instead. (Indeed,
do you really need ontologies?)
The number of class namespaces has no
appreciable impact on performance.
Use named graphs to avoid namespace
collision.
If you use Camel Case, stick with it. If
you use underscores, stick with it.
11. Turtle Tricks
TURTLE IS NOT JUST COMPACT
– IT’S A POWERFUL LANGUAGE
FOR ORGANIZING
INFORMATION, BUT ONLY IF
IT’S USED
12. Blank Nodes Are Structure Pointers
BLANK NODES CAN BE CONFUSING UNTIL
YOU KNOW THAT A BLANK NODE IS A
POINTER TO A STRUCTURE.
TURTLE NOTATION DELIBERATELY HIDES
BLANK NODES, BUT YOU CAN USE THEM TO
CREATE ARRAYS, DICTIONARIES,
PARAMETERS LISTS, AND SIMILAR
STRUCTURES.
AT THE SAME TIME, USING SHACL PROVIDES
A WAY OF DEPRECATING THE USE OF BLANK
NODES WHERE THEY REALLY SHOULD BE
NAMED NODES.
13. Square Brackets = Dictionaries
Turtle uses the square bracket to indicate a dictionary.
Character:Catwoman a Character: ;
Character:address [
a Address: ; #The class is usually implied
Address:street “1313 Mockingbird Lane”,
Address:city: “Arkham”;
Address:type AddressType:MailingAddress;
].
This is equivalent to
Character:Catwoman a Character: ;
Character:address _:CatWomanAddress ;
.
_:CatwomanAddress #This is the implicit blank node.
a Address: ;
Address:street “1313 Mockingbird Lane”,
Address:city: “Arkham”;
Address:type AddressType:MailingAddress;
.
Note that containership is implied, but illusory. These are still triples.
15. Parentheses = Linked Lists
Turtle uses parentheses to indicate a linked list.
Character:HarleyQuinn Character:memberOfOrg (
Org:SinCitySirens Org:SuicideSquad
).
This is equivalent to
Character: HarleyQuinn Character:memberOfOrg _:OrgList.
_:OrgList rdf:first Org:SinCitySirens;
rdf:next _:secondItem.
_:secondItem rdf:first Org:SuicideSquad;
rdf:next _:nil.
Linked lists are intrinsically ordered, regardless of whether
inferencing is enabled or not.
16. Annotations – Double Angle Brackets
An annotation is metadata that applies to a single assertion, using the RDF-
Star mechanism
<<Character:Batman Character:description “World’s greatest detective!”>>
Assertion:comment “Wait! What about Sherlock Holmes, or Hercules
Poirot?!”.
The triple in the double angle brackets is the subject of the annotation, and
again can be thought of as a blank node to a data structure with values,
subject, predicate, object:
_:assertion rdf:subject Character:Batman ;
rdf:predicate Character:Description ;
rdf:object “World’s greatest detective!”;.
Assertion:comment “Wait! What about Sherlock Holmes, or
Hercules Poirot?!”
The use of the <<>> notation is a new addition to RDF, RDF-Star, that is
currently undergoing discussion as a standard, though it is becoming adopted by
most most modern triple stores.
18. Annotations –
Best Practices
Typically an annotation of an
assertion will consist of a dictionary
node that holds several properties,
rather than just one comment.
An assertion has a unique identifier,
even if the two such assertions have
the same subject, predicate, and
object. This means that multiple
annotations can be made about the same
assertion by different authorities.
The Open Annotation Standard is a good
framework for annotating content, and
works especially well with RDF-Star.
Annotations can be useful to indicate
version changes of individual
assertions, as well as a way to track
provenance.
19. Pointer Containers
Occasionally, you’ll see pointer structures, such as
[] a Character: .
or
[ a Character:] rdf:label “Joe”.
An empty blank node is treated the same as an anonymous URI or pointer.
In the second case, this should be read as “there exists a Character, whose label is Joe”.
Two separate blank characters are assumed to have different URIs.
SPARQL notation emerged from the use of “named” blank nodes that were in fact
treated as variable names for nodes. Turtle structural notation consequently translates
directly into SPARQL Structural Notation.
20. Literals and Datatypes
One of the ways we underutilize knowledge graphs is in not doing enough with
datatypes.
Most people use the standard xsd: datatypes, without ever thinking about why
they shouldn’t.
If you have a length measure, rather than putting type in properties, use
“25”^^qudt:Meters.
If you have a full or partial population measure, use “8.01E9”^^quantity:People.
If you have a markdown document use “# Cool Titlen## by Kurt
Cagle”^^textFormat:Markdown.
Use your datatypes to indicate how literals are parsed, then add metadata. Nuff
said.
22. The Problem with CONSTRUCT
The CONSTRUCT command in SPARQL is often used to produce graphs, but
because of this utility, it also hides much more useful capabilities. It is an
anachronism from OWL Inference rules where addition of new triples would create
virtual triples by these rules.
Increasingly, organizations are abstracting access to knowledge portals to
GraphQL or JSON-LD, and as such SPARQL is hidden behind layers of security. One
benefit of this is that one underutilized feature of SPARQL Update – named
graphs – is beginning to come into its own, especially because it enables
workflows.
This is what we’ll cover here.
23. The True Structure of “Triples”
Subject Predicate Object Graph AssertionID
Character:Catwoman rdf:type Character: Default uri:urn:12051AFCD…
“True” Triple The Graph
Container
For the Triple
The Identifier
For the Triple
For RDF-Star
The modern “triple” is actually (minimally) a pentuple.
The assertion ID field is used for reification, and
identifies the pentuple as a unique object. The graph
field indicates that this particular tuple is part of a
specific set. If the triple is the same but the graph is
different, the Assertion ID will be different too.
24. Named Graphs
If a pentuple is set to a URI, that URI becomes the “name” of the graph that it
belongs to.
All other pentuples with the same name are in the same graph.
If a pentuple has the same triple values as another pentuple but has a different
graph name, then they are in different graphs.
This means that the same triple can be contained in multiple graphs.
The graph name is a URI just like node or assertion identifier.
25. Default Graph
When a triple is inserted into the graph without specifying it’s graph, the triple will
be placed in the default graph.
The default graph can be inclusive or exclusive. Where it is inclusive, a query
against a triple without specifying a graph will retrieve all triples from all graphs.
Where it is exclusive, the only triple that will be retrieved will be the one already in
the default graph.
Check with your vendor whether your system is inclusive or exclusive. Most such
systems can go from the one mode to the other with a simple software switch in
the product.
26. Named Graph Use Cases
Landing graph of newly ingested data
Graph containing ingestion graph data
converted to local ontology
Graph holding all instances of a given class
Graph containing reports generated from
analysis
Graph holding draft vs. approved resources
(workflow)
Graph containing SHACL constraints
Graph holding transformed content for
output
Graphs containing data catalogs
Graphs holding intermediate calculations
Graphs holding frequently requested query
results
Graphs containing documentation
Graphs containing document stores
Graphs containing controlled vocabularies
for rapid lookup
Unions, intersections, diffs
The list goes on and on …
27. Named Graphs
vs. Data Store
Partitions
Many data portals have distinct data stores that
are partitioned with certain configurations.
Such stores are usually best for multi-tenant
operations, as these typically also have
authentication and security considerations.
Named graphs are conceptually a level lower –
they exist within a single security perimeter
and are optimized for rapid clearing.
Modern named graphs usually provide
secondary indexes so that adding or removing
a triple is as simple as adding or removing a
URI to an array.
28. Internal Arrangement of Named Graphs
Subject Predicate Object Graph
Character:Catwoman rdf:type Character: Graph:Graph1
Graphs AssertionIDs
Graph:Graph1 uri:urn:12051AFCD…
Graph:Graph2 uri:urn:4792AE109…
Default uri:urn:319AD1592…
The Pentuple arrangement at a deeper level
illustrates how graphs can be moved,
copied and deleted so quickly, as the graph
key is itself a part of an index. Garbage
collection only occurs when the last graph is
removed from the tuple.
29. INSERTING DATA THROUGH SCRIPTS
The following SPARQL UPDATE script will add explicit triples to a graph:
# Namespaces Declared Here
INSERT DATA {
GRAPH Graph:Catwoman {
ex:Catwoman rdf:type ex:Antihero ;
rdfs:label "Catwoman" ;
ex:alterEgo "Selina Kyle" ;
ex:description "A skilled thief and occasional ally of Batman, who uses her
athletic abilities, martial arts skills, and cunning to navigate the criminal
underworld of Gotham City." ;
ex:superPowers "Peak human strength, agility, and endurance; expert martial
artist and hand-to-hand combatant; skilled thief and acrobat" ;
ex:gender "Female";
ex:universe "DCEU" ;
}
};
30. Digging Deeper Into Insert
The INSERT DATA command uses the same syntax (I believe) as the TRIG standard,
which, save that the namespaces use PREFIX instead of @PREFIX in the context
header.
Multiple graphs can be populated this way in the same statement.
This is often useful for spot or test data, or for configuration data.
Unlike SPARQL QUERY, SPARQL COMMANDS are transactional – multiple SPARQL
UPDATE statements can be run in the same script if separated by semi-colons.
31. Using the DELETE / INSERT model
The powerhouse of SPARQL UPDATE is the DELETE/INSERT/WHERE command
which can be thought of as the supercharged version of CONSTRUCT.
The WHERE clause determines the graph (and the variables) that the DEL/INS will
be working on.
The DELETE statement is a CONSTRUCT like statement that eliminates the triples
that are created from the CONSTRUCT graph from the main graph.
The INSERT statement is a CONSTRUCT like statement that add new triples to the
main graph from the CONSTRUCT graph if they do not already exist.
Together these three keywords can be used to transform one graph into another.
32. Delete/Insert/Where
This identifies the working group in the WHERE clause, then DELETES and INSERTS
the triples with the corresponding variables.
# Namespaces Declared Here
DELETE {
GRAPH ?gOld {
?sOld ?pOld ?oOld.
}
}
INSERT {
GRAPH ?gNew {
?sNew ?pNew ?oNew.
}
}
WHERE {
# Use SPARQL to determine old and new variables
}
};
33. GRAPH Commands
Command Example Comments
CREATE Graph CREATE graph:Foo Creates an Empty Graph with the given name
DROP Graph DROP graph:Foo Drops (deletes) the graph from the system
CLEAR Graph CLEAR graph:Foo Clears the data from the graph but retains the graph itself
COPY Graph COPY graph:Foo to graph:Bar Replaces the triples in graph:Foo to graph:Bar, but leaves graph:Foo
untouched.
MOVE Graph MOVE graph:Foo to graph:Bar Replaces the triples in graph:Foo to graph:Bar, but eliminates
Graph:Foo
ADD Graph ADD graph:Foo to graph:Bar Copies the triples in graph:Foo to graph:Bar, but without removing
the old graph:Bar content.
LOAD Graph Load <uri> to Graph:Foo Loads external RDF from a file system or the internet. <uri> must be a
hard reference.
34. Workflows with SPARQL UPDATE
SPARQL UPDATE is transactional, and can have multiple operations per script.
If the transaction fails at any point, the results are rolled back to the previous state.
Named graphs make it possible to create and populate a graph, then use that
graph to generate one or more additional graphs, which can then trigger other
actions.
Conditions within graphs also mean that a DELETE/INSERT statement can be
short-circuited (or activated) only if the right graph conditions exist in the WHERE
clause, making for conditional logic.
These are WORKFLOWS.
35. Superhero Example Workflow
1. Load an external Superheroes.ttl file into an ingestion graph.
2. Use DEL/INS to convert this file to an internal schema in superheroes graph.
3. From this converted schema, use DEL/INS to generate a SHACL file based upon the
superheroes graph, putting that into a SHACL graph.
4. Finally, use DEL/INS to create a message in a message queue graph indicating that an
update has been made to the superheroes and SHACL graphs.
And on to the demo!!!
(I will be posting the demo breakdown to a separate article called Workflows In
Sparql at https://thecaglereport.com.
36. Workflow Thoughts
Transactions lock graphs in use. This means that you can create temporary graphs
in your script, so long as they get dropped upon completion.
Temporary graphs can also be used to save and alter working triples. This is a way
of storing variables between transactions. You cannot set global variables directly
in transactions otherwise.
You cannot run a SELECT or CONSTRUCT statement at the transaction level.
However, you can run them from within a WHERE clause.
LOAD, sadly, does not support a WHERE clause. To load from external resources,
you may need to use SERVICE invocations instead, which can be run from the
WHERE clause.
37. Passing Variables Between Transactions
# Create Temporary graph with variable content
INSERT {
GRAPH Graph:Temp {
Temp:date1 Temp:hasValue now().
}
}
WHERE( bind(true() as ?true)
};
# IN a later transaction retrieve the variable value.
INSERT {
{Transation:123 has ?date.}
WHERE {
GRAPH Graph:Temp {Temp:date1 Temp:hasValue ?date}
};
38. Ingest Thoughts
There are a number of ways to get non-RDF data into a knowledge portal.
Most commercial portals have connectors to JSON, XML, relational databases, YAML, message
queues, openTelepathy (COMING SOON!) …
It is STILL worth the time to map these to an internal organizational ontology.
Internal transformations can create maps to relevant controlled vocabularies and
taxonomies.
To get a good start, use AutoGPT AI or similar to do the bulk of the mapping for you.
This is where having a way of identifying different ontologies comes in handy, and
while usually get you 80% of the way there.
That remaining 20% is often critical for your business, and deserves to have human
eyeballs on it.
39. Last Thoughts on Named Graphs
Wrap your instances by associated classes in named graphs for that class, and
stuff that graph name into your SHACL metadata for that class
The class graph will be much smaller than trying to search by class name, find the
associated graph, then retrieve the results.
If you’re really ambitious, wrap each instance in a named graph tied into the
subject IRI, then use ?s (rdf:*)+ ?o to get the full transitive closure for ?s, to put
into that graph. This will get you a super DESCRIBE that will often get you info you
normally have to write a lot of ugly code to get, and it’s FAST.
Most knowledge portals have named graph endpoints. Go wild.
41. Turtle and JSON
Not all systems support it, but a few extension functions can prove immensely
valuable.
The function toJSON(listNode|graphNode) as string will convert either the root node
of a list or a named graph node into a serialized JSON string that can then be
persisted in a literal of type rdf:JSON. This can be used in SPARQL and Sparql
Update
The function fromJSON(jsonStr,graphNode will convert that string back into triples in
the given graphNode and would be available in Sparql Update.
This ability really comes in handy with SELECT statements serialized to JSON,
which then contains the serialized literals as sub-JSON fragments
42. Presentation as Function
VIEW PRESENTATIONS AS
MODULES (LIKELY RUNNING IN
NODEJS) THAT CAN BE
SELECTED TO SHAPE OUTPUT.
PRESENTATION MODULES
WOULD LIKELY BE WRITTEN IN
NODEJS AND WOULD BE ABLE
TO ACCESS THE KNOWLEDGE
GRAPH VIA SPARQL CALLS.
PRESENTATION MODULES
COULD HANDLE DIFFERENT
VARIANTS OF JSON, XML,
MARKDOWN, CSV AND SO
FORTH, AS WELL AS PERFORM
OUTBOUND
TRANSFORMATIONS TO PIDGIN
ONTOLOGIES.
SIMILAR INBOUND MODULES
COULD HANDLE NATURAL
LANGUAGE QUERIES IN A
MANNER SIMILAR TO CHATGPT,
AS WELL AS SIMPLIFY GRAPHQL
DEPLOYMENT.
43. SHACL for Schema Metadata
Regardless of whether you validate content or not, think about using SHACL
within your applications for schema metadata
SHACL works well with RDFS, and can help to document your schemas
SHACL is a good place to store metadata equivalencies
SHACL can hold presentation metadata that can simplify UX dramatically.
SHACL is often used in conjunction with GraphQL
SHACL can support function definition and metadata.
44. You Can Do Worse Than Jena
Big data is sexy. We want our databases to be huge and comprehensive, even if
99.9995% of that data is never, ever touched. It’s why we get so excited about
large language models in AI, even though they’re too complex to keep up to date.
Perhaps it’s time to think small again. Jena’s an open-source knowledge portal
with a barely-there UI. But … slap a Nodejs front end running Express on its front,
create named services that handle workflows along with a pretty UX, and you
usually can get what you need up and running within days, rather than months.
Think about turning them into Solid Pods while you’re at it – a good idea that just
needs the right platform.
Think not about ingestion, but Expression!
45. Don’t Sweat Ontologies
An ontology is a glorified term for an organization’s language. Your organization is
likely to be different than mine, so its language will be different. There’s nothing wrong
with that.
Think in terms of pidgins (no, not the birds). A pidgin is a trade language, simplified so
that people speaking it can get most of the ideas across, even if it involves a lot of
hand-waving.
As you build out your language, add equivalent terms (or transformations) to your
classes and properties to map to those pidgins you use. It need not be perfect – we’re
getting pretty good at translation.
When you need that final 20%, get on the phone and talk with your customers, your
vendors, your agents. Knowledge graphs are really good for storing pidgins.
Don’t sweat the small stuff.
46. Big Trends
GraphQL is becoming the mechanism to talk to
knowledge graphs. Make your GraphQL RDF compliant,
and you’re golden. Use SPARQL for the heavy graphy
stuff that shouldn’t be public anyway.
SHACL is showing up as the way to universally define
schemas. Use SHACL to drive your GraphQL interfaces.
Graph doesn’t always have to be Turtle, but JSON that
can represent RDF is a win across the board.
Markdown is deconstructing HTML. It’s driving code
repositories and is the language that LLMs are using.
The age of the intricate web app may be ending as
making data meaningful overrides making web pages
overinteractive.
The buzzword for today is Generative. Knowledge
Portals are Generative Engines. Think about it.
47. Why I Like XSLT 3
XML is dead.
However, JSON by itself is difficult to traverse, because dictionaries and arrays are
two very different things. Recursion is hard on JSON.
However, if you canonicalize JSON (a relatively easy and fast process) as tokens
that can be represented as XML, then you can use the same kind of deep recursive
processing that XML people were used to doing.
Language is recursive.
XSLT3 is a recursive pattern matching transformation engine that works with most
data formats, including JSON and RDF. It can denormalize relational data into
trees and vice versa. It’s a pretty decent non-LLM based text interpreter as well.
48. The Dinosaur in
the Living Room
Are you tired of AI yet?
What we’re discovering about large
language models is that the solution to AI is
not to suck up Wikipedia and Github.
Instead, it is to create smaller, manageable,
composite models that can be merged
together when needed to build up
contextual engines.
LORAs, which started out in the Diffusion
space, are now giving way to Chinchillas
that are beginning to look more and more
like … knowledge graphs.
In simpler terms, you don’t need one super-
duper genius, but a few relatively smart
people working together.
49. Feeding Your LLM (and SLMs)
AI is great at classifying, but poor at naming. It is surprisingly good at
summarizing, something people generally are not great at. It is getting better at
reasoning, but that is an expensive capability.
Knowledge graphs can benefit from LLM capabilities, but more to the point,
knowledge graphs can also in turn provide provenance, evolution and higher
order reasoning to both large and small language models.
While there are a number of different approaches, JSONL is becoming the
preferred mechanism for fine tuning such models. RDF is superfood to such
models, rich with connections.
50. Summary
Knowledge Portals should be
transformation engines.
Knowledge graphs can represent complex
structures in a more universal manner than
any other data representation (including AI).
Knowledge graphs primary weaknesses
stem from becoming too fixated on rigor
and protocol, even as other technologies
evolve around it.
By beefing up the RDF stack to better allow
for map/reduce transformations especially,
we stand a better chance of remaining not
only relevant but vital.
Generative AI (Machine Learning) and
Symbolic AI (Semantics) must work
together, as they represent collectively the
breadth of knowledge programming.