Thursday, September 19, 2013

Hybrid RDF Graphs (or, Hybrid ABoxes, as you want to see it ;-) )

This is a exciting new feature that allows to combine virtual RDF (mappings) with real RDF (or, virtual ABoxes with ABox assertions). This is a unique feature in ontop, in other systems either you have mappings and everything is about SPARQL to SQL, or you have triples and you have a triple store. 

With hybrid RDF graphs you can have an ontology with axioms and data as follows (in turtle syntax):

Axiomatic triples
:ceoOf rdfs:domain        :CEO
:CEO   rdfs:subClassOf    :BusinessMan
:ceoOf rdfs:subPropertyOf :worksFor

Data triples
:Bill_Gates :ceoOf :Microsoft 

Mappings
:person/{ID} :knows :Bill_Gates
SELECT ID FROM tbl_microsoft_employees


Note how the mapping states that all people that are created from IDs in tbl_microsoft_employees know Bill Gates. Bill gates is a sort of "global" individual. Moreover, we also know some things about Bill Gates, i.e., that he is the CEO of Microsoft. And we know some things about the business world, i.e., that the domain of ceoOf is a CEO, that a CEO is a kind of BussinesMan, and that being a ceo of a a company is one way of working for that company. 

Now we execute queries like the following and get the answers that we expect:

SELECT ?x ?y WHERE {
   ?x :knows ?y. ?y a :BusinessMan ; :worksFor :Microsoft 
}

As always, ontop will translate this SPARQL query into an SQL query, and in this particular case the query will look something like this:

SELECT "person/{ID}" as x, ":Bill_Gates" as y
FROM tbl_microsoft_employees

Notice that there is a lot going on here, this is not just query translation. There was reasoning going, involving all axioms in the ontology, the data triples and the mappings. In the end, we arrive to the simple, efficient query that we would write manually, and that will get you great performance even in the presence of large volumes of data.

Why to use hybrid RDF graphs?

This functionality is useful when you have large volumes of data, which wouldn't be efficient to translate into RDF and you want to keep in the original RDBMS, but at the same time you have some (not so large volume of) data that you want to use during query answering. The smaller dataset is to little to bother to insert it into the RDBM and make mappings for it, or it simply belongs in the ontology, i.e., it is domain knowledge, not application data.

Limitations

This functionality is available only for Class and Object Properties. That is, you may not have data triples like: 

:Bill_Gates :age "57"^^xsd:integer
:Bill_Gates :name "William Henry Gates"

Performance

Using hybrid RDF graphs may slow down the query rewriting process. The system deals with rdf triples as if they where mappings that require nothing from the DB. That means that all those facts are considered during the SQL generation, and having too many of them may slow things down during query translation.

Free variables: Particularly, query rewriting maybe become slow in queries that have "free classes" or "free properties" in the graph patterns, for example:

SELECT ?x ?p WHERE { ?x ?p :mariano }

or

SELECT ?x ?c WHERE { ?x rdf:type ?c }

If you are experiencing slow query rewriting because of this, try to avoid having these "free" patterns in isolation. Use them only if there is a "non-free" section of the query with which you can JOIN them. This will restrict the query and will limit the facts that are involved in answering your query, making everything faster. For example:

SELECT ?x ?c 
WHERE {?x :hasFather ?y. ?x :hasAge ?z. ?x rdf:type ?c }

JOIN order: At the moment, make sure that any triple patterns in SPARQL that are related to data triples are at the end of the query. Specially those with free predicates. For example, this is not good

SELECT ?x ?c 
WHERE { ?x rdf:type ?c . ?x :hasFather ?y. ?x :hasAge ?z. }

but this is good:

SELECT ?x ?c 
WHERE {?x :hasFather ?y. ?x :hasAge ?z. ?x rdf:type ?c }


A good join order is the one in which triple patterns which are more "restricted" come first. For virtual RDF graphs (pure mappings) this doesn't matter, but for Hybrid it might matter a lot. In the future we hope to improve this, but for the moment you should take it into account.

Number of data triples: The number of facts (data triples) will affect performance of query rewiring. How much is "too big" and when query rewriting may become slow depends on your memory, machine, the SPARQL query and how much the ABox interacts with the Tbox. However, the current implementation should allow for a few thousand ABox assertions in normal hardware.  


Give this kind of modelling a try and let us know how it goes!

No comments:

Post a Comment