20
Aug

Exposing Resources in Datomic Using Linked Data

Financial data feeds from various data providers tend to be closed off from most people due to high costs, licensing agreements, obscure documentation, and complicated business logic. The problem of understanding this data, and providing access to it for our application is something that we (and many others) have had to solve over and over again. Recently at Pellucid we were faced with three concrete problems

  1. Adding a new data set to make data visualizations with. This one was a high-dimensional data set and we were certain that the queries that would be needed to make the charts had to be very parameterizable.

  2. We were starting to come to terms with the difficulty of answering support questions about the data we use in our charts given that we were serving up the data using a Finagle service that spoke a binary protocol over TCP. Support staff should not have to learn Datomic’s highly expressive query language, Datalog or have to set up a Scala console to look at the raw data that was being served up.

  3. Different data sets that we use had semantically equivalent data that was being accessed in ways specific to that data set.

And as a long-term goal we wanted to be able to query across data sets instead of doing multiple queries and joining in memory.

These are very orthogonal goals to be sure. We embarked on a project which we thought might move us in those three directions simultaneously. We’d already ingested the data set from the raw file format into Datomic, which we love. Goal 2 was easily addressable by conveying data over a more accessible protocol. And what’s more accessible than REST. Goal 1 meant that we’d have expose quite a bit of Datalog expressivity to be able to write all the queries we needed. And Goal 3 hinted at the need for some way to talk about things in different data silos using a common vocabulary. Enter the Linked Data Platform. A W3C project, the need for which is brilliantly covered in this talk. What’s the connection? Wait for it…

The RDF Datomic Mapping

Let’s do a little bit of a technical deep dive for a second and consider the RDF format, the language of linked data or as it’s wiki page puts it “a standard model for data interchange”. If you haven’t come across it yet, RDF is basically a representation of a graph as a set of triples. In the vocabulary of RDF, a triple is a Subject, Predicate and Object. Conveniently, Datomic, in addition to being many other things, is also a triple store. Only they call the triples facts and their components: Entity, Attribute and Value. There is a natural mapping between Datomic facts and RDF triples. Entities identified by unique Long values map into unique Subject and Object URLs. Attribute keywords map into Predicate URLs. Primitive Values that Datomic supports can be represented quite easily with RDFs data types. The other important property that falls out of this mapping is that the “where” clause of Datalog queries, also being a set of triples, can quite naturally be represented in RDF with some encoding of variables.

This brought us close to Goal 2, but not quite all the way there. Hand waving over a lot of the details, it should be easy to intuit that if we had this mapping defined somewhere, we can quite easily represent a Datalog query and the results of that query in RDF. So that’s what we did, sort of. We came up with a slightly more compact representation of the query:

prefix ldp: http://www.w3.org/ns/ldp#
prefix owner: https://data.pellucid.com/ontology/owner#
prefix invStyle: https://data.pellucid.com/ontology/investmentStyle#

<> ldp:match (
    [ ldp:path (owner:ID) ]
    [ ldp:path (owner:styleRef invStyle:description) ; ldp:value "Growth" ]
) .

<> ldp:projection [ ldp:path (owner:styleRef invStyle:description) ] ,
                  [ ldp:path (owner:orientation) ] .

The above is a Turtle serialisation of the query represented as RDF data. Ignoring the representation for a second, this query is essentially pulling the “investing orientation” and “investment style” description from a database of investment managers, or “owners”, and is the equivalent to the SQL query:

SELECT o.id, is.description, o.orientation
FROM owners o, investmentStyles is
WHERE
o.ID IS NOT NULL AND
o.investmentStyleID = is.ID AND
is.description = "Growth"

The result would be:

In SQL:

o.id | o.orientation | is.description
-------------------------------
123  | "Active"      | “Growth”
789  | "Passive"     | “Growth”
…

In RDF, using relative URIs as defined in LDP:

</owner/123> </ontology/owner/ID> 123
</owner/123> </ontology/owner/orientation> "Active"
</owner/123> </ontology/owner/investmentStyleRef> </investmentStyle/456>
</investmentStyle/456> </ontology/investmentStyle/description> "Growth"

</owner/123> </ontology/owner/ID> 789
</owner/123> </ontology/owner/orientation> "Passive"
</owner/123> </ontology/owner/investmentStyleRef> </investmentStyle/987>
</investmentStyle/987> </ontology/investmentStyle/description> "Growth"

Query Compilation

As for how. The query is executed by POSTing it to the owner’s container (/owner/). A container in the Linked Data Platform is the same as the root URL of a resource in familiar REST semantics and conceptually represents a table in a data silo. So what we expect to get in response are owner entities and entities that they are related to. (As a side note, one of the reasons RDF is preferable to JSON as a data interchange format is that it represents a graph; a more general data structure than JSON’s tree).

The ldp:match section in the above query would get compiled down into the “where” clauses of a Datalog query for all owners that had an ID attribute with any value and a styleRef/description attribute which equals “Growth”. (The parentheses around the matches indicate that this is a list of matches rather than a set. The ordering can be tweaked to put the most restrictive matches first, allowing the client to optimise the query). Even in SQL this corresponds to the “where” clause. However, since Datalog queries are actual clojure data structures instead of strings, they’re easier to build up than SQL queries.

Now with the resulting owner entities the ldp:projection section contains the directives for the related resources that you want inlined in the result set. If no projections were specified, just the list of owner URLs would be returned. So with these two sections you could construct any simple Datalog query that did not involve functions or aggregations. The conversion of the results of the projection to an RDF graph is somewhat trivial since the format of the Datomic facts are already a set of triples.

Are we there yet?

Coming back to our initial goals, we can now send flexible queries and receive human readable (e.g. HTML, or indeed machine readable e.g. Turtle) entity objects over HTTP. Which covers Goals 1 and 2. At this point we’ve actually already addressed Goal 3. Notice the owner:styleRef. It’s shorthand for a URL that points to a definition in a global ontology. This way if we ever get another data set that has information about the investment style of an owner we can reuse this predicate and map it to the Attribute in that data set’s silo.

We have made all sorts of extensions to the query language to allow a wider set of possible Datalog queries including aggregations and inequality constraints, which merit their own post. Being essentially declarative, there is a lot of scope for optimisation of the queries taking advantage of Datomic’s laziness for optimisation.

Next steps include addressing that long term goal of querying across silos for which I think we would probably go with another component of the Semantic Web stack: SPARQL. Another potential improvement is exposing Datomic’s basisT of a database value as HTTP ETags. And finally, a really far-reaching vision of this project that borrowed so much from the web was to give back to it in the form of our ontology. We believe that a shared ontology of financial data could be very beneficial to many and open up the normally closeted world of handling financial data.

Bio

Ratan Sebastian is a software engineer at Pellucid Analytics. He works on our data team, specifically on linked data and closely with Datomic. He attended IIT with a degree in industrial engineering, but jumped onboard the functional programming bandwagon at school and has become a Scala afficionado.