When people think about designing ontologies, they frequently leave one of the most powerful tools available to a modeler – the ability to create named graphs and annotations – on the cutting room floor. Part of the reason for this is that, in order to do real justice to named graphs, it is necessary to bring out the heavy guns of SPARQL Update. However, the benefits to be had by being able to manipulate graphs usually make it well worth the effort.
An Introduction to Named Graphs
Named graphs were introduced as part of the revisions to SPARQL in 2014. A named graph is, in simplest terms, a set of triples with an associated graph identifier, turning these triples into quads. That is to say if I have a graph described by the following triples,
person:person:JaneDoe a person: .
person:JaneDoe rdfs:label "Jane Doe" .
person:JaneDoe person:hasPet pet:Felicia .
pet:Felicia pet:type pet:cat .pet:Rover pet:type pet:dog
.
I can group together the first four statements into a named graph.
GRAPH graph:JaneDoeData {
person:JaneDoe a person: .
person:JaneDoe rdfs:label "Jane Doe" .
person:JaneDoe person:hasPet pet:Felicia .
pet:Felicia petType:cat .
}
pet:Rover pet:type petType:dog .
Internally, the graph name part is represented by another slot beyond the normal :subject
, :predicate
, :object
slots used so heavily Turtle and SPARQL, specifically intended to represent the graph itself. If a triple is not explicitly within a graph, it is considered in the default graph. Thus, the assertion {pet:Rover a animal: .}
is considered to be in the default graph, designated by multiple triples within braces with no associated graph name.
Visually, this can be shown as follows:
There are several things to note here:
- First, the default graph is all-encompassing – everything within the JaneDoeData graph is also in the default graph. This also suggests that one graph can be contained within another graph, such that if graph B is in graph A, then any item in graph B will also be part of graph A.
- Similarly, you can create Venn diagrams of graphs.
- Not all items in a given graph need to start with the same subject (nor necessarily even be connected). For instance, in the above graph
person:JaneDoe
andpet:Felicia
are both subjects. - On the other hand, if a named graph is queried, only those items in the named graph will be part of the solution.
For instance, the following SPARQL script will only pick up the statement pet:Felicia pet:type petType:cat
.
select ?pet ?petType where {
GRAPH graph:JaneDoeData {?pet pet:type ?petType}
}
while working with the default graph will retrieve both that and the equivalent statement for Rover:
select ?pet ?petType where {
GRAPH default {?pet pet:type ?petType}
}
since graph:JaneDoeData is WITHIN the default graph. Indeed, if the same triple is contained in two different graphs, the SPARQL engine will see these as distinct assertions because the named graphs are different. This can lead to apparent duplication, especially if the user making the query is unaware that graphs are in use. This is one reason why the use of graphs should be made early on in the ontology process because graphs add a layer of organization (and correspondingly complexity) to any model.
Update: Different knowledge graphs have different behavior when dealing with the default graph. In Allegro’s GraphDB, for instance, the behavior defaults to the example given above, while with both Stardog and Jena, the default is to treat the default as a distinct graph that doesn’t include named graphs. This behavior can be changed depending on your system. In the case where the default is not encompassing, the state looks as follows:
Named Graphs and SPARQL Query/Update
Named graphs start to come into their own when combined with SPARQL Update. For instance, consider one of the more annoying aspects of SPARQL – the rather anemic result sets that come with the SPARQL DESCRIBE statement. Ordinarily, what gets returned are direct literal values for a given IRI for literal datatypes, but usually just the IRIs themselves for objects. For web applications, it would be nice to retrieve the labels, types, and type labels for the objects, as otherwise, you have to parse out the objects and handle multiple operations manually.
One solution is to pre-populate a graph for each subject node containing this additional data. There are several advantages to this approach, not least of which is that as a retrieval method, it’s fast. You’re reading a list of triples in a graph rather than performing multiple joins in extracting those triples. The primary drawback is that should any subordinate property change, then the graph will not reflect those changes until the next time that graph is generated. (In other words, it acts like any other validating cache).
The heart of this routine is one that I hadn’t realized until lately, involving a special predicate called rdf:*. This is a universal predicate commonly used in most knowledge graph systems (though it isn’t a part of the RDF standard). It also has absolutely nothing to do with the rdf-star specification.
# getGraphEnvelope($rootNode as node())
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?subject ?predicate ?object WHERE {
bind($rootNode as ?rootNode)
?rootNode (rdf:*/^rdf:*)+ ?subject.
?subject ?predicate ?object.
}
Ordinarily, DESCRIBE returns the immediate children of the requested root node, but if the object is a blank node (typical of a data structure), then the triples for that blank node also get resolved. In the above script, on the other hand, this will perform the transitive closure over all outbound nodes from the given node. As such, it could get quite deep, though as a general rule of thumb, outbound vector paths usually tend to be considerably less numerous (and typically shorter) than inbound paths.
# createEnvelope($rootNode)
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX graph: <http://www.example.com/ns/graph#>
DELETE {GRAPH ?rootNode {?s1 ?p1 ?o1}}
INSERT {
GRAPH ?rootNode {
?subject ?predicate ?object.
?rootNode rdf:about [
graph:hasLastModifiedDate ?now.
graph:hasLastModifiedBy $systemIdentifier.
rdfs:label ?message.
]
}
}
WHERE {
graph ?rootNode {?s1 p1 ?o1}
bind($rootNode as ?rootNode)
?rootNode (rdf:*)+ ?subject.
?subject ?predicate ?object.
bind(now() as ?now)
bind (concat("Graph ",$rootNode," ",?now) as ?message)
}
This routine uses the URI of the resource in question as the graph identifier. It eliminates the old graph if such exists, generates an envelope of all outbound triples connected to the resource, then creates a message that attaches to the graph indicating when and where it was created. $rootNode is then passed as the parameter when the SPARQL update is invoked.
Once an envelope has been created, you can retrieve it as a simple graph call:
fn:graph($rootNode)
One interesting consequence of this approach is that this call will also automatically retrieve the ancestry path for the class if this is defined in the broader graph.
FROM and FROM NAMED
Note that a graph IRI is, like most things in RDF, simply a computer-friendly name for the graph resource, which is, in turn, a set of triples. Regardless of whether the default graph is inclusive or exclusive, you can use the FROM keyword to set the default to be a named graph temporarily:
For instance, suppose you had three graphs: graph:cats
, graph:dogs
, and graph:ferrets
. If you wanted the triples available from the first two graphs but not the third as the default, you’d use the FROM keyword:
SELECT *
FROM graph:cats
FROM graph:dogs
WHERE {
?s ?p ?o
}
In this particular case, the normal default graph is replaced with the union of graph:cats
and graph:dogs
. There’s no need at this point to use a particular graph context – from the standpoint of the query, it is working on a single default graph just made up of these two subgraphs. This holds only for the duration of this query, by the way.
In some cases, it may make more sense to distinguish between one group and another. This is where FROM NAMED comes in. The FROM NAMED requires that you specify the graph that a given query be applied over and keep track of those. For instance, if you wanted to see where, given a union of two sets, a given triple came from one or the other, you could use the query:
SELECT DISTINCT ?g ?s
FROM NAMED graph:cats
FROM NAMED graph:dogs
FROM NAMED graph:ferrets
WHERE {
GRAPH ?g {?s ?p ?o}
}
FROM NAMED, not surprisingly, is used considerably less than the default FROM keyword.
Overlapping Graphs
Is it possible to have a triple be in two graphs simultaneously? Certainly, for instance, it makes sense to separate category data, such as controlled vocabularies, reference data, taxonomies, and so forth, from more transactional data. The former, for instance, may be modeled in SKOS and have a very definite hierarchical structure, while the latter may be considerably flatter and more linear. At the same time, it may make sense to keep each term in that category data in its own graph.
In SPARQL UPDATE, this can be managed by using INSERT DATA command:
insert data
graph graph:vocabulary {
color:maroon a color:
rdfs:label "Maroon"@en;
.
}
graph color:{
color:maroon a color:
rdfs:label "Maroon"@en;
color:hasRGBValue "#400000"^^color:RGB;
color:hasFamily color:maroon;
.
}
graph color:maroon {
color:maroon a color:
rdfs:label "Maroon"@en;
color:hasRGBValue "#400000"^^color:RGB;
color:hasFamily color:maroon;
.
}
}
In this particular scenario, the declaration is kept in three graphs – graph:vocabulary
, color:
, and color:maroon
.
SELECT ?label ?family ?value
FROM color:
WHERE {
?color a color: .
?color rdfs:label ?label .
?color color:hasFamily ?family .
?color color:hasRGBValue ?value.
} order by ?value
There is redundancy in the data, but here, there’s no particular reason to specify the maroon graph (we might not know it yet). The color: graph insures that we are limiting our search exclusively to the set of colors.
Ultimately, this is the value of a named graph – it replaces one or more joins with a list of triples, which can make the difference between a slow query and a far faster one.
Loading and Manipulating Graphs
Sparql update, as shown above, provides a powerful way of both dynamically and statically placing triples into graphs. However, for those whose workflows are more oriented towards ingestion rather than computation, there are alternatives. Trig, or Turtle with Graphs, is a standard similar to Turtle that allows for the specification of graphs at ingest time. More information about trig can be found at https://www.w3.org/TR/trig/. More information about SPARQL update can be found at https://www.w3.org/TR/sparql11-update/.
Beyond these, SPARQL UPDATE has a number of operations that should be familiar to anyone working with SQL. These are outlined in Table 1.
Command | Example | Description |
---|---|---|
LOAD | LOAD myData.ttl INTO GRAPH graph:holding | This loads data from the given URL into the specified graph (or the default graph if the INTO clause is not given. |
CLEAR | CLEAR graph:holding | Empties the triples from the graph but keeps the graph defined. |
DROP | DROP graph:holding | Same as CLEAR, but also removes the graph definition from the system |
MOVE | MOVE graph:holding TO graph:processed | Clears the second graph, then moves the content of the first graph into the second, and clears the first |
COPY | COPY graph:holding TO graph:processed | Clears the second graph, then moves the content of the first graph into the second, but does not clear the first graph. |
ADD | ADD graph:holding TO graph:processed | Appends the conent of the first graph into the second, without clearing either graph. |
CREATE | CREATE graph:holding | Creates a new, empty graph |
These operations are transactional in nature, and, as with INSERT and DELETE, they can be performed by putting a SEMI-COLON between each operation. Note that this is not strictly necessary according to the SPARQL UPDATE specification, but it seems to be the way that most knowledge graph systems implement complex updates.
Named Graphs and Workflows
Not surprisingly, once you start thinking about SPARQL Update in transactional terms, the idea of creating workflows seems a logical next step.
For instance, suppose that you wanted to load in a file (likely converted from a spreadsheet), identify categories within a vocabulary within the data, and convert these in the data to references in this vocabulary. As an example, we’d like to go from
tableAddress:Row5 a tableRow: ;
rdfs:label "125 Springfield Way SEATTLE WA USA 98002"^^xsd:string;
row:hasTable tableAddress: ;
property:StreetAddress "125 Springfield Way"^^xsd:string;
property:City "Seattle"^^xsd:string;
property:State "Washington"^^xsd:string;
property:Country "USA"^^xsd:string;
property:ZipCode "98002"^^xsd:string;
.
to
address:AE2193FFBC112928
a address: ;
rdfs:label "125 Springfield Way SEATTLE WA USA 98002"^^xsd:string;
address:hasCity city:SeattleWA;
address:hasRegionState regionState:WashingtonUSA;
address:hasCountry country:USA;
address:hasPostalCode "98002"^^identifier:USZipCode;
address:hasSource tableAddress:Row5;
.
We’ll further assume that we only want terms within the vocabulary section being used for lookup, and (because this is already a long article) that the names of cities, states and so forth in the data source correctly match with the names in the vocabulary. One final assumption – each vocabulary class has a shacl definition that looks something like this:
address: a sh:NodeShape;
sh:targetClass address: ;
sh:property address:hasCity, address:hasRegionState, address:hasCountry;
sh:ext:hasAnalog
.
address:hasRegionState a sh:PropertyShape;
sh:path address:hasRegionState;
sh:class regionState: ;
sh:nodeKind sh:IRI ;
shext:hasAnalog property:State;
.
address:hasCity a sh:PropertyShape;
sh:path address:hasCity;
sh:class city: ;
sh:nodeKind sh:IRI ;
shext:hasAnalog property:City;
.
address:hasCountry a sh:PropertyShape;
sh:path address:hasCountry;
sh:class country: ;
shext:hasAnalog property:Country;
sh:nodeKind sh:IRI ;
sh:default country:USA;
.
address:hasStreetAddress a sh:PropertyShape;
sh:path address:hasStreetAddress;
sh:datatype xsd:string ;
shext:hasAnalog property:StreetAddress;
sh:nodeKind sh:Literal ;
.
address:hasPostalCode a sh:PropertyShape;
sh:path address:hasPostalCode;
sh:datatype xsd:string ;
shext:hasAnalog property:ZipCode;
sh:nodeKind sh:Literal ;
.
In this case, shext: is an (arbitrary) extension to shacl, identifying all of the source properties that have a direct translation to the referenced property. A more comprehensive translation would actually likely wrap the source property with a transformation, but that’s beyond the scope of this article. The SHACL files are, of course, kept in a SHACL graph.
Given this, we can now make a workflow (I’m deliberately not including namespaces here, but they should be included).
CLEAR GRAPH graph:staging ;
LOAD file:///home/me/path/to/source.ttl INTO GRAPH graph:staging ;
CREATE GRAPH graph:assembly ;
CLEAR GRAPH graph:assembly;
INSERT WITH GRAPH graph:assembly {
?instance a ?targetObject;
rdf:label ?label;
?targetProperty ?object;
.
}
WHERE {
GRAPH graph:staging {
?row a tableRow: .
?row rdfs:label ?label.
?row row:hasTable ?sourceTable.
?row ?sourceProperty ?value.
GRAPH shacl: {
?targetClass shext:hasAnalog ?sourceTable.
?targetClass sh:property ?targetProperty.
?targetProperty shext:hasAnalog ?sourceProperty.
}
}
GRAPH vocabulary: {
?category a ?vocabularyClass.
{{?category skos:prefLabel ?value}
UNION
{?category skos:notation ?value}}
}
bind(if(bound(?category),?category,?value) as ?object)
bind(IRI(concat(?targetClass,uuidStr())) as ?instance)
};
CLEAR GRAPH graph:staging ;
MOVE GRAPH graph:assembly TO GRAPH graph:final
Whoosh! This is, admittedly, a very simply conversion (a full version would take into account a number of factors not covered here), but even this shows the power of SPARQL UPDATE as a workflow engine.
Conclusion
Named graphs can majorly affect your knowledge graph’s organization and performance. Such named graphs make workflow operations possible, and because graph data is determined by the fourth tuple, there is comparatively little movement of data (and hence reindexing) needed to use these capabilities.
Kurt Cagle is the Editor in Chief of The Cagle Report, a former community editor for Data Science Central, and is the principal for Semantical LLC, as well as a regular contributing writer for Linked In. He has written twenty four books and hundreds of articleson programming and data interchange standards. He maintains a Calendly free consultation site at https://calendly.com/semantical – if you have a question, want to suggest a story, or just want to chat, set up a free consultation appointment with him there.
You must log in to post a comment.