My first knowledge graph

I've been trying to learn about knowledge graphs for a while. Learning about "knowledge graphs" is a lot like "learning SQL" or "learning Lisp"; it's a standard, a convention, an idea. I think I've made some progress recently, and so here's my attempt to write something I wish I'd had at the start.

My interest in this subject stems from sharing data. In that context, I heard about knowledge graphs through the concept of "linked data". I don't want to go into all that too much just yet because one of the biggest issues I struggled with was which concepts/standards/acronyms I needed to focus on. Instead, I want to start with the barest notion of what knowledge graphs are, how to define some simple data in this system, and how to query that data.

I think a reasonable enough definition of "knowledge graph" is any data structure constructed using RDF (Resource Description Framework) or any data structure that can be queried through SPARQL. It's as simple and frustratingly vague as that. It is the start of our decent into the special madness that thrives in the seam between complex standards and their implementations.

defining data with triples

Knowledge graphs are a list of RDF statements; an RDF statement is a triple; a triple has components subject, object, and predicate; subjects, objects, and predicates are RDF terms.

"Knowledge graphs" is the subject, "a list of RDF statements" is the object, and "are" is the predicate; "an RDF statement" is the subject, "triple" is the object, and "is a" is the predicate; "triple" is the subject this time, "a subject, an object, and a predicate" is the object, and "has components" is the predicate; "subjects, objects, and predicates" are the subjects, "term" is the object, and "are" is the predicate.

This is how all information is represented in a knowledge graph. Across all usages of RDF and the different related standards, a graph will always contain a set of triples. For this reason, "triple store" is a common term to describe whatever system actually stores all of this stuff. There are many different ways of representing the subject, object, and predicate and the "graph" part of "knowledge graph" is defined by the choices made in representing all those statements. I work better if I have a concrete example or system to interact with and the way I've been keeping it concrete so far is by using Ontotext GraphDB to construct simple graphs. I think it's been a decent way to explore the gap between specification and implementation while also having solid ground to walk on.

In short, it's the job of RDF to provide a mechanism for defining connections; following those connections is SPARQL's job. Before I get to that, though, I think it's worth taking a look at some different text representations of triples.

reading RDF statements

I created a graph using the statements above. Here's what our knowledge graph about knowledge graphs looks like in a format called Turtle:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix g: <my:> .

g:RDF a "list of statements" .

g:statement a g:triple .

g:subject a _:genid-0d0e16066f2f4f8facb63f3a8ebbd544148815-term .

g:object a _:genid-0d0e16066f2f4f8facb63f3a8ebbd544148815-term .

g:predicate a _:genid-0d0e16066f2f4f8facb63f3a8ebbd544148815-term .

_:node7 rdf:first g:subject;
  rdf:rest _:node8 .

_:node8 rdf:first g:object;
  rdf:rest _:node9 .

_:node9 rdf:first g:predicate;
  rdf:rest rdf:nil .

g:knowledge-graph g:uses g:RDF .

g:triple g:has _:node7 .

I picked Turtle because I think it's one of the better formats for getting a high-level feel for the graph just by reading the data. Still, we're confronted with a lot of new details. Even a few simple sentences in a relatively readable format introduces a mix of built-in RDF concepts and informal conventions. I'll highlight a few things in the file that were confusing to me when I first starting learning RDF.

First, I can see the relevant terms from our sentences but almost all of them have extra stuff around them. This is because the nodes are formatted depending on their node type. We have three such types in this graph. The nodes that start with g: and rdf: are IRIs. I think these only need to be unique within any particular graph. However, since one of the use cases of RDF is to link data across the entire World Wide Web these IRIs often end up becoming globally-unique identifiers. This is often done by labelling nodes with URLs, some of which may point to valid resources on the web and some are purely symbolic. It's confusing! This is also why we have those two @prefix lines at the beginning. That's a Turtle convention to make reading triples a little easier. Another node type is the literal value. In this case we just have one: the string "list of statements". The last major node type in this graph is a "blank" node. These are the terms that start with _:. It's a way of saying that the node represents something but we don't really need it to represent anything concrete. There's also the predicate a, which is a built-in label for the "is a" predicate and ultimately resolves to an IRI. It occurs often enough to justify the shortcut.

There are a couple other node categories, but I think those can wait. They're RDF terms defined using other RDF concepts and the recursion really did a number on my brain when I first read about them.

The other main thing that catches my eye is that our sentence "a triple has components subject, object, and predicate" looks pretty chopped up. The topic of formally describing our nodes deserves a dedicated post, but basically what happened is that I used some shorthand for describing a collection of things as the object of that triple. Since we had some information about subjects, objects, and predicates (they are all terms), I wanted to make sure those relationships could be represented as well. What my shorthand translated to is an RDF version of a linked list (no relation to linked data). The final triple in our graph, g:triple g:has _:node7, doesn't refer to the entire collection as one node. Instead, it points to the first element, g:subject, but it also has this rdf:rest relationship to the next element in our list.

The last thing I want to note is the formatting of "term". Why's it like that? Well, the short answer is because I made "term" a blank node for no reason. I thought about making it a normal node but after looking at a different format I decided to keep it in. I think it highlights the arbitrariness of blank node labels more so than "node7" and company.

Speaking of other formats, Turtle is just one of thirteen formats that GraphDB exports out of the box. Let's look at one more:

<my:RDF> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> "list of statements" <my:blog-post> .
<my:statement> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <my:triple> <my:blog-post> .
<my:subject> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> _:genid2d0d0e16066f2f4f8facb63f3a8ebbd5441488152dterm <my:blog-post> .
<my:object> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> _:genid2d0d0e16066f2f4f8facb63f3a8ebbd5441488152dterm <my:blog-post> .
<my:predicate> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> _:genid2d0d0e16066f2f4f8facb63f3a8ebbd5441488152dterm <my:blog-post> .
_:node7 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> <my:subject> <my:blog-post> .
_:node8 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> <my:object> <my:blog-post> .
_:node9 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> <my:predicate> <my:blog-post> .
_:node7 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> _:node8 <my:blog-post> .
_:node8 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> _:node9 <my:blog-post> .
_:node9 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> <my:blog-post> .
<my:knowledge-graph> <my:uses> <my:RDF> <my:blog-post> .
<my:triple> <my:has> _:node7 <my:blog-post> .

This notation is called N-Quads. Each line is a triple plus the name of the graph the data corresponds to. I like including the graph name because I'm not sure if it's more common for the data to specify the graph name or if it's left to decide at import time. Either way, it seems nice to include a handle for the data and further insulate it from getting mangled in a global graph context.

Unlike Turtle, N-Quads includes the full IRI in each node rather than keeping a prefix at the top of the file. This includes the predicate a, which we now see is a stand-in for the IRI http://www.w3.org/1999/02/22-rdf-syntax-ns#type.

The other thing I noticed is that the blank node label for "term" has a different format. I exported this from the same software and at the same time as the Turtle export, and yet even between these formats the label for our blank node representing "term" is different. Just a little reminder that we might not want to rely on blank nodes for anything we'd want to preserve across systems.

a note on URLs

I mentioned this earlier but I think it's worth repeating. Some URLs point to valid resources on the web, but some are purely symbolic. I think the intention is that any URL should point to a valid resource, but since ownership of URLs is maintained by a different system the contents could change without us noticing. Link rot can spread from our web pages to our knowledge graphs.

querying data with SPARQL

I think of RDF as defining the shape of our data. SPARQL exists to actually work on that data. There are a few high-level operations, including the familiar insert and select and delete. For example, here's how I created the graph from the Turtle and N-Quad files above:

prefix m: <my:>
insert data {
    graph m:blog-post {
    	m:knowledge-graph m:uses m:RDF.
        m:RDF a "list of statements".
        m:statement a m:triple.
        m:triple m:has (m:subject m:object m:predicate).
        m:subject a _:term.
        m:object a _:term.
        m:predicate a _:term.
    }
}

Here we have a representation similar to Turtle: we've got our prefixes and our a predicates. We're also specifying that we want to add this to a graph named <my:blog-post> (which is what m:blog-post ultimately expands to). I think this is required by GraphDB despite being listed as optional in the SPARQL spec. Ultimately, though, this reminds me a lot of SQL insert statements: we specify a destination for our new data and then declare that data below. Especially after reading the Turtle data above I feel like I have a good sense of what's going on here.

One area where the SQL mental model breaks down is in SPARQL's where clauses. Let's consider this query to find out what a subject is:

PREFIX m: <my:>
select ?object where {
    m:subject a ?object .
} limit 100

What's interesting to me is that this looks similar to our insert query. I'm writing out a triple and putting in a variable wherever I want to capture the result. Instead of reminding me of SQL, though, this reminds me of logic programming. I don't have much experience with logic programming but my impression is that, in general, one declares the shape of the data they're looking for and ask the query engine to find all values that could complete the shape. There are a few other types of query beyond select, too. One of them is ask and its description in GraphDB documentation is even closer to logic programming:

ASK — returns “YES” if the query has a solution, otherwise “NO”

The key word for me here is solution. Our queries are, fundamentally, questions about our data and those questions may or may not have answers. Here's an example of an ask query:

PREFIX m: <my:>
ask {
    m:subject a ?object .
} limit 100

In this query we're asking our graph if it has any triples where the subject is m:subject and the predicate is a. The query engine then tries to find any triples matching that structure. Instead of returning any matching objects, we just care that a match exists so it returns a boolean. We do have a triple of that form so this query returns YES. It may seem like a simple feature but I think it's a good example of SPARQL responding to what kinds of questions people have and meeting them where they're at. I don't have to write a query and check if the result set is empty and then convert that observation into a boolean. The specification designers understand what I'm actually trying to find out.

next steps

There is a lot more to knowledge graphs than just these few examples. There are ways to use RDF to describe RDF. There are ways to link graphs to other graphs. I'll be exploring these concepts further by trying to describe my music library. I'll try to give shape to the next round of these amorphous concepts by writing out some facts for one of my favorite albums, Stereolab's Dots and Loops.

other relevant sources