I came to RDF through a different path than most people. Back in the 2000s, I’d been fairly heavily involved in XML land, and was wrestling with the first incarnation of XQuery, XSLT 2, and the revised XPath specifications while working on a couple of books. At the time, I was using the eXist XML Database as the basis for one of the books and noticed after a while that one of the primary problems that XML faced (and that JSON faces today) is that it did not handle references between objects well. It was one thing that led me into RDF.
This article talks some about many to many relationships, and also introduces SHACL, a subject I’ve touched on before but haven’t really explored in depth. SHACL is making its way into a number of different triple store implementations, and to me represents the next logical step in the evolution of the RDF stack. I hope to carry this through to other articles in this series.
The Thorns of Many to Many Relationships
One of the key factors in the development of the relational database was the realization back in the 1970s by Ted Codd. Codd was critical in identifying normal forms that quantify the nature of one-to-one (1st Normal form), one-to-many (2nd Normal form), and many-to-man (3rd normal form) relationships. Even with document constructs and similar stores, such normal forms are still critical for dealing with complex data structures, especially once time becomes a critical component in a data system.
For instance, consider the case of a Person record. That person may, over time, have several different addresses.
flowchart LR p1([Jane Doe]) h1([House 1 in Boston]) h2([House 2 in New York]) h3([House 3 in Seattle]) p1 -- lived in --> h1 p1 -- lived in --> h2 p1 -- lived in --> h3
In the world of XML, this kind of relationship might be rendered as follows:
<Collection
xmlns="http://www.w3.org/some/fake/rdf/namespace#"
xmlns:Class="http://www.example.com/Class#"
xmlns:Person="http://www.example.com/Person#"
xmlns:House="http://www.example.com/House#"
>
<Class:Person label="Jane Doe" id="Person:JaneDoe">
<Person:livedIn>
<Class:House id="House:House1" label="House 1 in Boston">
<House:street>1313 Mockingbird lane</House:street>
<House:city>
<City idref="City:Boston" label="Boston"/>
</House:city>
</Class:House>
<Class:House id="House:House2" label="House 2 in New York">
<House:street>123 Sesame Street</House:street>
<House:city>
<Class:City idref="City:NewYork" label="New York"/>
</House:city>
</Class:House>
<Class:House id="House:House3" label="House 3 in Seattle">
<House:street>121314 Rain Drive</House:street>
<House:city>
<Class:City idref="City:Seattle" label="Seattle"/>
</House:city>
</Class:House>
</Person:livedIn>
</Class:Person>
<Class:City label="Boston" id="City:Boston"/>
<Class:City label="New York" id="City:Boston"/>
<Class:City label="Seattle" id="City:Boston"/>
</Collection>
For those who lived, ate and breathed XML, this code listing looks perfectly reasonable, if a little heavy on prefixed namespaces. There are a few points to note:
- Namespace prefixes are defined at the root node of the document
- ids and idrefs use curies – Condensed URI Notation for identifiers and references to those identifiers respectively
- Classes have a one-to-one relationship with namespaces.
- Because of this, at any given point, you can see what properties belong to which objects.
- IdRefs (links to resources) incorporate labels, which are advisory only (used primarily for user interfaces,
- Properties always separate a subject from an object if that object is a curie.
RDF-XML sort-of got this right, but it also made some design decisions that, in hindsight, were poor. In a rush to keep lookup tables small, RDF-XML made the mistake of requiring that any reference to a URI was a fully qualified name. It also made the (confusing) assumption that hypertext links were node links in the graph (primarily because of the belief that one should use RDF primarily for annotating HTML pages). Finally, RDF tied the rdf: and rdfs: schema properties into the definition of RDF content.
<rdf:RDF
xmlns:rdf="http://www.w3.org/some/fake/rdf/namespace#"
xmlns:Class="http://www.example.com/Class#"
xmlns:Person="http://www.example.com/Person#"
xmlns:House="http://www.example.com/House#"
xmlns:City="http://www.example.com/City#"
>
<rdf:Description rdf:about="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<Class:Person label="Jane Doe" rdf:resource="Person:JaneDoe">
<Person:livedIn>
<Class:House rdf:resoure="http://www.example.com/House#House1" label="House 1 in Boston">
<House:street>1313 Mockingbird lane</House:street>
<House:city>
<City idref="http://www.example.com/City#Boston" label="Boston"/>
</House:city>
</Class:House>
<Class:House rdf:about="House:House2" label="House 2 in New York">
<House:street>123 Sesame Street</House:street>
<House:city>
<Class:City rdf:about="http://www.example.com/City#NewYork" label="New York"/>
</House:city>
</Class:House>
<Class:House id="House:House3" label="House 3 in Seattle">
<House:street>121314 Rain Drive</House:street>
<House:city>
<Class:City rdf:about="http://www.example.com/City#Seattle" label="Seattle"/>
</House:city>
</Class:House>
</Person:livedIn>
</Class:Person>
</rdf:Description>
<rdf:Description rdf:about="http://www.example.com/City#Boston">
<Class:City label="Boston" rdf:resource="http://www.example.com/City#Boston"/>
</rdf:Description>
<rdf:Description rdf:resource="http://www.example.com/City#NewYork">
<Class:City label="New York"/>
</rdf:Description>
<rdf:Description rdf:resource="http://www.example.com/City#Boston">
<Class:City label="Boston"/>
</rdf:Description>
</rdf:RDF>
If this was a biographical application, having one record for each address might make sense. However, what happens if the application is for land use records, where there may be several people within the system that occupied the same house over the history of that house?
flowchart LR p1([Jane Doe]) p2([John Smith]) p3([Eleanor Rigby]) h3([House 3 in Seattle]) p1 -- lived in --> h3 p2 -- lived in --> h3 p3 -- lived in --> h3
At this stage, the modeling which worked in one case (a person may have multiple domiciles, each of which could be part of the same record), now shifts to each domicile may have multiple owners, resolving this into what amounts to a many-to-many relationship.
The hierarchical nature of XML breaks down somewhat when everything is considered an object in its own right, which is how a database is (somewhat) structured.
RDF solves this problem – you can create nodes in a graph that connect to either literal properties or nodes representing other objects. Modeling here can be accomplished by using what’s called the third normal form, which involves creating an intermediate object type (essentially a table) with cross-references, something we can call a domicile:
flowchart LR p1([Jane Doe]) p2([John Smith]) p3([Eleanor Rigby]) h2([House 2 in New York]) h1([House 1 in Boston]) h3([House 3 in Seattle]) d1([Domicile from 2003 to 2008]) d1 -- has person --> p1 d1 -- has house --> h1 d2([Domicile from 2008 to 2014]) d2 -- has person --> p1 d2 -- has house --> h2 d3([Domicile from 2014 to 2023]) d3 -- has person --> p1 d3 -- has house --> h3 d4([Domicile from 2011 to 2014]) d4 -- has person --> p2 d4 -- has house --> h3 d5([Domicile from 2002 to 2011]) d5 -- has person --> p3 d5 -- has house --> h3
The primary downside to this form of modeling is that the number of intermediate objects (domiciles) grows on the order of P × H (where P is the number of people and H is the number of houses). Additionally, it makes querying things difficult, regardless of the kind of query language involved,
One of the powerful aspects about Turtle is that it is much easier to showcase a referential approach, especially in examples. For instance, the above model can be illustrated as follows:
@prefix Class: <http://www.example.com/ns/Class#>.
# Persons (i.e., People)
@prefix Persons: <http://www.example.com/ns/Persons#>.
Persons:_JaneDoe a Class:_Persons.
Persons:_JohnSmith a Class:_Persons.
Persons:_EleanorRigby a Class:_Persons.
# Cities
@prefix Cities: <http://www.example.com/ns/Cities#>.
Cities:_BostonMa as Class:_Cities.
Cities:_NewYorkNY as Class:_Cities.
Cities:_SeattleWA as Class:_Cities.
#Houses
@prefix Houses: <http://www.example.com/ns/Houses#>.
Houses:_House1 a Class:_Houses;
Houses:hasCity City:_BostonMA;
.
Houses:_House2 a Class:_Houses;
Houses:hasCity City:_NewYorkNY;
.
Houses:_House3 a Class:_Houses;
Houses:hasCity City:_Seattle;
.
#Domiciles
@prefix Domiciles: <http://www.example.com/ns/Domiciles#>.
Domiciles:_Domicile1 a Class:_Domiciles;
Domiciles:hasPerson Persons:_JaneDoe a Class:_Persons;
Domiciles:hasHouse House:_House1;
Domiciles:hasStartDate "2003-01"^^gYearMonth;
Domiciles:hasEndYear "2008-10"^^gYearMonth;
.
Domiciles:_Domicile2 a Class:_Domiciles;
Domiciles:hasPerson Persons:_JaneDoe a Class:_Persons;
Domiciles:hasHouse House:_House2;
Domiciles:hasStartYear "2008-11"^^gYear;
Domiciles:hasEndYear "2014-06"^^gYear;
.
Domiciles:_Domicile3 a Class:_Domiciles;
Domiciles:hasPerson Persons:_JaneDoe a Class:_Persons;
Domiciles:hasHouse House:_House3;
Domiciles:hasStartYear "2014-07"^^gYear;
.
Domiciles:_Domicile4 a Class:_Domiciles;
Domiciles:hasPerson Persons:_JohnSmith a Class:_Persons;
Domiciles:hasHouse House:_House3;
Domiciles:hasStartDate "2011-04"^^gYearMonth;
Domiciles:hasEndYear "2014-06"^^gYearMonth;
.
Domiciles:_Domicile5 a Class:_Domiciles;
Domiciles:hasPerson Persons:_EleanorRigby a Class:_Persons;
Domiciles:hasHouse House:_House3;
Domiciles:hasStartDate "2002-02"^^gYearMonth;
Domiciles:hasEndYear "2011-03"^^gYearMonth;
.
I’ve stripped out labels and irrelevant metadata, though you should assume their existence. The format is also a bit different in that tha namespaces are introduced with each respective class listing rather than as part of a preamble. While not standard, it’s allowed in all Turtle parsers, and helps to get across the idea that the namespaces are tied into their respective classes. Finally, I’ve bucked convention a bit here by using the plural form of classes to identify them in namespaces and prefixes. Again, this serves to emphasize that classes are sets.
As an aside, there are a number of conventions that exist within the RDF space that are given because they worked with a particular tool or approach, but that don’t actually offer any functionality or purpose. In some cases, they can even be deceptive because they suggest functionality that doesn’t exist. If it serves you better, be willing to break with these conventions, so long as you declare that you are doing so. It may suggest ways of modeling that may not be immediate obvious otherwise.
The above example provides data within a working (albeit tiny) ontology. What it doesn’t do is actually declare the relationships that exist. This is a good point to explore SHACL and how it can be used to do just that.
A Brief Introduction to SHACL
When I started this article, my intent was to focus on SHACL, but the problem of many-to-many relationships was on my mind at the time. Not surprisingly, what comes out of the fingertips of writers doesn’t always reflect their conscious intent.
SHACL is the Shape Constraint Language, and was developed in the 2010s largely through the efforts of TopQuadrant. It was intended primarily as a lighter weight language to describe relationships that exist between various nodes within a graph, and it’s closest analog would be the XML Schema Definition Language (XSD) developed about 15 years earlier. Unlike OWL it isn’t specifically a language intended to describe logical relationships, but rather, a lower level language intended primarily to work on graphs at a more granular level.
As an example, you could use OWL to create a relationship indicating that a person was domiciled in a given house at a certain time. Domiciled-ness – the property of being a niece to another person – is in essence a shape. It is a pattern that can be described and validated against, without the requirement that this formally describe a class. This is arguably a very subtle distinction, and there are many in the community who would argue that you could do the same thing with OWL.
SHACL comes into its own in two ways. First (and arguably the most important), SHACL came about after the introduction of SPARQL, whereas OWL predated it. Indeed, SPARQL can be thought of as a standardization and simplification of OWL rules, turning it into both a set selection and set construction language. SPARQL has its own validation capabilities in the ASK command, in which a boolean is returned if the pattern is satisfied:
ASK {
?person a Class:_Persons.
?house a Class:_Houses.
?domicile a Class:_Domiciles.
?domicile Domiciles:hasPerson ?person.
?domicile Domiciles:hasHouse ?house.
}
If ?person and ?house are both previously defined, this will return true if the?house was one in which the person lived at some time in the past, and nothing otherwise.
This is the second area where SHACL shines. While I can build logic based upon this kind of switch, what is often more helpful is understanding why a query may have failed. This is something that OWL was never really intended to do, and something that can be done in SPARQL but is awkward or cumbersome to do, especially since there may in act be several reasons for failure. SHACL can be used to construct SHACL reports, which are graphs that contain information describing the multiple points of failure of an invalidated node. While this can aid signficantly in debugging, it can also be a critical component in the creation of workflows.
SHACL as Schema
The basis of SHACL is a set of triples which identify the shapes themselves. In general, the shapes they identify are likely to be classes or properties, though there is no specific requirement that they have to be. They make use of the SHACL namespace and language. The Domiciles and Houses classes have properties, but Domiciles is suffiiciently complex to be interesting:
@prefix sh: <https://www.w3.org/ns/shacl>.
# This defines the class shape.
Class:Domiciles a sh:NodeShape,owl:Class;
sh:classTarget Class:Domiciles;
sh:closed true;
sh:message "Properties are identified that have not yet been declared.";
sh:property Domiciles:hasPerson,
Domiciles:hasHouse,
Domiciles:hasStartDate,
Domiciles:hasStartDate_YearMonth,
Domiciles:hasStartDate_Cardinality,
Domiciles:hasEndDate;
.
Domiciles:hasPerson a sh:PropertyShape,rdfs:Property;
sh:path Domiciles:hasPerson;
sh:name "person";
sh:description "This is the tenant (Person) who lives in the domicile during a particular period. "^^xsd:string;
sh:nodeKind sh:iri;
sh:class Class:_Persons;
sh:minCount 1;
sh:message """A domicile must have one or more associated persons.""^^xsd:string;
.
Domiciles:hasHouse a sh:PropertyShape,rdfs:Property;
sh:path Domiciles:hasHouse;
sh:name "house";
sh:description "This is the house that a tenant lived in. "^^xsd:string;
sh:nodeKind sh:iri;
sh:class Class:_Houses;
sh:minCount 1;
sh:maxCount 1;
sh:message """A domicile must have one and only one associated house.""^^xsd:string;
.
Domiciles:hasStartDate a sh:PropertyShape,rdfs:Property;
sh:path Domiciles:hasStartDate;
sh:name "startDate";
sh:description "This is the date that a given tenant (person) moved into a house, given as a YYYY-MM formatted string. "^^xsd:string;
sh:nodeKind sh:literal;
.
Domiciles:hasStartDate_YearMonth a sh:PropertyShape;
sh:path Domiciles:hasStartDate;
sh:nodeKind sh:literal;
sh:datatype xsd:yearMonth;
sh:message """Start dates must be in the gYearMonth format.""^^xsd:string;
.
Domiciles:hasStartDate_Cardinality a sh:PropertyShape;
sh:path Domiciles:hasStartDate;
sh:nodeKind sh:literal;
sh:minCount 1;
sh:maxCount 1;
sh:message """There can be one and only one start dates.""^^xsd:string;
.
Domiciles:hasEndDate a sh:PropertyShape,rdfs:Property;
sh:path Domiciles:hasEndDate;
sh:name "endDate"
sh:nodeKind sh:literal;
sh:minCount 0;
sh:maxCount 1;
sh:greaterThan Domiciles:hasStartDate;
.
Shape Nodes
The ShapeNode definition usually matches to a class definition or something roughly similar though it doesn’t have to. The subject of the shape node can be a class, but can also be an open-ended shape or even a blank node. When it is a class node, it usually makes sense to define the class characteristics at the same time (i.e., specializing the type of the node via rdf:type (or the special predicate “a”, which means rdf:type. The sh:targetClass
predicate can also be used to specify the class, especially when the shape applies to more than one classes.
The sh:closed
predicate indicates whether the model is considered to use the semantic closed world assumption (the property definitions listed are the only ones that are declared) or open world – there could be other properties or constraints. OWL makes no distinction – everything is open-world, but most relational databases are clearly closed in nature. The primary purpose of this is to allow for a message in the case that the assumption is broken.
The sh:message
predicate contains a string that gets returned as part of a report any time a given test fails, and is context sensitive. The string is normally static, and if it is not included then the system usually provides a default message given the types of errors that have occurred.
Property Nodes
Property nodes describe and augment properties, and corresponding fairly closely to property nodes in XSD. Property nodes are typically bound to shape nodes via the sh:property
predicate, though this isn’t strictly necessary (a property node can be declared independently, which means that the property nodes are treated as global properties). In the example above, the property nodes Domiciles:hasPerson and Domiciles:hasHouse are in fact predicates bound to the Class:_Domiciles class.
Start date is a bit more problematic, though does illustrate the distinction between properties and property shapes:
Domiciles:hasStartDate a sh:PropertyShape,rdfs:Property;
sh:path Domiciles:hasStartDate;
sh:name "startDate";
sh:description "This is the date that a given tenant (person) moved into a house, given as a YYYY-MM formatted string. "^^xsd:string;
sh:nodeKind sh:literal;
.
Domiciles:hasStartDate_YearMonth a sh:PropertyShape;
sh:path Domiciles:hasStartDate;
sh:nodeKind sh:literal;
sh:datatype xsd:yearMonth;
sh:message """Start dates must be in the gYearMonth format.""^^xsd:string;
.
Domiciles:hasStartDate_Cardinality a sh:PropertyShape;
sh:path Domiciles:hasStartDate;
sh:nodeKind sh:literal;
sh:minCount 1;
sh:maxCount 1;
sh:message """There can be one and only one start dates.""^^xsd:string;
.
Domiciles:hasEndDate a sh:PropertyShape,rdfs:Property;
sh:path Domiciles:hasEndDate;
sh:name "endDate"
sh:nodeKind sh:literal;
sh:minCount 0;
sh:maxCount 1;
sh:greaterThan Domiciles:hasStartDate;
sh:message "End date must be more recent than start date."^^xsd:string;
.
The Domiciles:hasStartDate
property is a bit unusual in that there is a distinction between what is a class property (Domiciles:hasStartDate
) and what is a shape property (Domiciles:hasStartDate_YearMonth
and Domiciles:hasStartDate_Cardinality
). The first identifies the property node and it’s associated path, and would be used to infer that Domiciles:hasStartDate
is a property of Class:_Domiciles
. The latter two, on the other hand, test specific constraint conditions (such as cardinality or the way that a date is represented). They still represent the same property, because they have the same path
.
The sh:path
property is one of the more misunderstood (and important) properties in SHACL. It is what differentiates a property shape (which has one) from a node shape (which does not). The property path identifies the location of targets for the property. Most of the time, these will be the objecets of the property predicate itself (here, Domiciles:hasStartDate
), but it is possible (especially when a constraint is not synonymous with a predicate) for it to represent the objects of an RDFList, the objects of a transitive closure (e.g., skos:narrowTopic*
), or a substructure using a blanknode. This becomes especially significant when SHACL is used in conjunction with the sh:sparql constraint, which I won’t get into detail here.
Properties are aware of other properties within its domain if they are connected. For instance, suppose that you want to guarantee that the end date, if it exists, should be greater than the start date, you can reference it with the sh:greaterThan
keyword. This contextual awareness can considerably expand what you can do with SHACL, as you can make multiple constraints in this manner.
I’ve also incorporated the sh:name
property here. It, along with sh:description
provides a degree of documentation, and is also used to identify and describe property names that may be independent of the corresponding predicates.
Validating with SHACL and SHACL Reports
There are a number of different ways that vendors and open source providers who support SHACL (including Jena, TopQuadrant, Allegrograph, Stardog and others) have implemented SHACL. The most common is to create a graph that holds all of the SHACL library within your knowledge graph, then pass in one or more nodes into a validate(shacl_graph, ...testNodes)
function that can be called by a host language, such as within SPARQL.
The SHACL specification does not define external interfaces, only what a SHACL report looks like. For instance, taking the above suppose that you had the following domicile node:
Domiciles:_Domicile8 a Class:_Domicile;
# No person has been identified for the domicile.
# House4 has not been defined yet.
Domiciles:hasHouse Houses:House4;
#Start date is of the wrong format
Domiciles:hasStartDate "2021-03-04"^^xsd:date;
#End date as given occurs before start date.
Domiciles:hasEndDate "2020-05-11"^^xsd:date;
.
And that the SHACL validation was contained in the graph Graphs:_SHACL
, then the function ex:validate(Graphs:_SHACL, Domiciles:_Domiciles8)
would return the following “report” graph:
[ a sh:ValidationReport ;
sh:conforms false ;
sh:result [ a sh:ValidationResult ;
sh:focusNode Domiciles:_Domicile8 ;
sh:resultMessage "A domicile must have one or more associated persons." ;
sh:resultPath Domiciles:hasPerson ;
sh:resultSeverity sh:Violation ;
sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
sh:sourceShape Class:_Domiciles
],
[ a sh:ValidationResult ;
sh:focusNode Domiciles:_Domicile8 ;
sh:resultMessage "A domicile must have one and only one associated house." ;
sh:resultPath Domiciles:hasPerson ;
sh:resultSeverity sh:Violation ;
sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
sh:sourceShape Class:_Domiciles
],
[ a sh:ValidationResult ;
sh:focusNode Domiciles:_Domicile8 ;
sh:resultMessage "Start dates must be in the gYearMonth format." ;
sh:resultPath Domiciles:hasStartDate ;
sh:resultSeverity sh:Violation ;
sh:sourceConstraintComponent sh:DatatypeConstraintComponent;
sh:sourceShape Class:_Domiciles
],
[ a sh:ValidationResult ;
sh:focusNode Domiciles:_Domicile8 ;
sh:resultMessage "End date must be more recent than start date." ;
sh:resultPath Domiciles:hasStartDate ;
sh:resultSeverity sh:Violation ;
sh:sourceConstraintComponent sh:GreaterConstraintComponent;
sh:sourceShape Class:_Domiciles
],
] ;
.
What makes this so impressive is that you can get multiple errors caught at the same time. If you pass a data graph rather than a single resourcce node, then this would identify all of the errors of everything within the graph that has an associated SHACL validator. This can also be transformed into XML or JSON and fed into client side components. to display information accordingly in a dashboard or similar interface.
Summary
Next up, I hope to look at how SHACL can be used to help drive interfaces, before turning to SHACL based functions and extensions in Javascript, SPARQL and elsewhere.
Kurt Cagle is the editor of The Cagle Report.
You must log in to post a comment.