14.3. IRI Dereferencing
There are many cases when RDF data should be retrieved from remote sources only when really needed.
E.g., a scheduling application may read personal calendars from personal sites of its users.
Calendar data expire quickly, so there's no reason to frequently re-load them in hope that they are queried before expired.
Virtuoso extends SPARQL so it is possible to download RDF resource from a given IRI, parse them and store the resulting triples in a graph, all three operations will be performed during the SPARQL query execution.
The IRI of graph to store triples is usually equal to the IRI where the resource is download from, so the feature is named "IRI dereferencing"
There are two different use cases for this feature.
In simple case, a SPARQL query contains from clauses that enumerate graphs to process, but there are no triples in DB.DBA.RDF_QUAD taht correspond to some of these graphs.
The query execution starts with dereferencing of these graphs and the rest runs as usual.
In more sophisticated case, the query is executed many times in a loop.
Every execution produces a partial result.
SPARQL processor checks for IRIs in the result such that resources with that IRIs may contain relevant data but not yet loaded into the DB.DBA.RDF_QUAD.
After some iteration, the partial result is identical to the result of the previous iteration, because there's no more data to retrieve.
As the last step, SPARQL processor builds the final result set.
14.3.1. IRI Dereferencing For FROM Clauses, "define get:..." Pragmas
Virtuoso extends SPARQL syntax of from and from named clauses.
It allows additional list of options at end of clause: option ( param1 value1, param2 value2, ... )
where parameter names are QNames that start with get: prefix and values are "precode" expressions, i.e. expressions that does not contain variables other than external parameters.
Names of allowed parameters are listed below.
-
get:soft is the retrieval mode, supported values are "soft" and "replacing".
If the value is "soft" then the SPARQL processor will not even try to retrieve triples if the destination graph is non-empty.
Other get:... parameters are useless without this one.
-
get:uri is the IRI to retrieve if it is not equal to the IRI of the from clause.
Thes can be used if data should be retrieved from a mirror, not from original resource location or in any other case when the destination graph IRI differs from the location of the resource.
-
get:method is the HGTTP method that should be used to retrieve the resource, supported methods are "GET" for plain HTTP and "MGET" for URIQA web service endpoint.
By default, "MGET" is used for IRIs that end with "/" and "GET" for everything else.
-
get:refresh is the maximum allowed age of the cached resource, no matter what is specified by the server where the resource resides.
The value is an positive integer (number of seconds). Virtuoso reads HTTP headers and uses "Date", "ETag", "Expires", "Last-Modified", "Cache-Control" and "Pragma: no-cache" fields to calculate when the resource should be reloaded, this value can become smaller due to get:refresh but can not be incremented.
-
get:proxy address of the proxy server, as "host:port" string, if direct download is impossible; the default is to not use proxy.
If a value of some get:... parameter repeats for every from clause then it can be written as a global
pragma like define get:soft "soft".
The following two queries will work identically:
sparql
select ?id
from named <http://myhost/user1.ttl>
option (get:soft "soft", get:method "GET")
from named <http://myhost/user2.ttl>
option (get:soft "soft", get:method "GET")
where { graph ?g { ?id a ?o } };
sparql
define get:method "GET"
define get:soft "soft"
select ?id
from named <http://myhost/user1.ttl>
from named <http://myhost/user2.ttl>
where { graph ?g { ?id a ?o } };
It can make text shorter and it is especially useful when the query text comes from client but the parameter should have a fixed value due to security reasons:
the values set by define get:... can not be redefined inside the query and the applciation may prepend the text with desired pragmas before the execution.
Note that the user should have SPARQL_UPDATE role in order to execute such a query.
By default SPARQL web service endpoint is owned by SPARQL user that have SPARQL_SELECT but not
SPARQL_UPDATE.
It is possible in principle to grant SPARQL_UPDATE to SPARQL but this breaches the whole security of the RDF storage.
14.3.2. IRI Dereferencing For Variables, "define input:grab-..." Pragmas
Consider a set of personal data such that one resource can list many persons and point to resources where that persons are described in more details.
E.g. resource about user1 describes the user and also contain statements that user2 and user3 are persons and more data can be found in user2.ttl and user3.ttl,
user3.ttl can contain statements that user4 is also person and more data can be found in user4.ttl and so on.
The query should find as many users as it is possible and return their names and e-mails.
If all data about all users were loaded into the database, the query could be quite simple:
sparql select ?id ?fullname ?email
where {
graph ?g {
?id a <Person> ;
<FullName> ?fullname ;
<EMail> ?email .
} };
It is possible to enable IRI dereferencing in such a way that all appropriate resources are loaded during the query execution even if names of some of them are not known a priori.
sparql
define input:grab-var "?more"
define input:grab-depth 10
define input:grab-limit 100
define input:grab-base-iri "http://myhost/"
select ?id ?fullname ?email
where {
graph ?g {
?id a <Person> ;
<FullName> ?fullname ;
<EMail> ?email .
optional { ?id <SeeAlso> ?more } } };
The IRI dereferencing is controlled by the following pragmas:
-
input:grab-var specifies a name of variable whose values should be used as IRIs of resources that should be downloaded.
It is not an error if the variable is sometimes unbound or gets values that can not be converted to IRIs (e.g., integers) -- bad values are silently ignored.
It is also not an error if the IRI can not be retrieved, this makes IRI retrieval somewhat similar to "best effort union" in SQL.
This pragma can be used more than once to specify many variable names.
It is not an error if values of different variables result in same IRI or a variable gets same value many times -- no one IRI is retrieved more than once.
-
input:grab-iri specifies an IRI that should be retrieved before executing the rest of the query, if it is not in the DB.DBA.RDF_QUAD already.
This pragma can be used more than once to specify many IRIs.
The typical use of this pragma is querying a set of related resources when only one "root" resource IRI is known but even that resource is not loaded.
-
input:grab-all is the simplest possible way to enable the feature but the resulting performance can be very bad.
It turns all variables and IRI constants in all graph, subject and object fields of all triple patterns of the query into values for
input:grab-var and input:grab-iri,
so the SPARQL processor will dereference everything what might be related to the text of the query.
-
input:grab-seealso specifies an IRI of an predicate similar to foaf:seeAlso.
Predicates of that sort suggest location of resources that contain more data about predicate subject.
The IRI dereferencing routine may use these predicates to find additional IRIs for loading resources.
This is especially useful when the text of the query comes from remote client and may lack triple patterns like
optional { ?id <SeeAlso> ?more } from the previous example.
The use of input:grab-seealso makes the SPARQL query nondeterministic, because the order and the number of retrieved documents will
depend on execution plan and they may change from run to run.
This pragma can be used more than once to specify many IRIs, but this feature is costly.
Every additional predicate may result in significant number of lookups in the RDF storage, affecting total execution time.
-
input:grab-limit should be an integer that is a maximum allowed number of resource retrievals.
The default value is pretty big (few millions of documents) so it is strongly recommended to set smaller value.
Set it even if you're absolutely sure that the set of resources is small, because program errors are always possible.
All resource downloads are counted, both successfull and failed, both forced by input:grab-iri and forced by input:grab-var.
Nevertheless, all constant IRIs specified by input:grab-iri (or input:grab-all) are downloaded before the first check of the input:grab-limit counter,
so this limit will never prevent from downloading "root" resources.
-
input:grab-depth should be an integer that is a maximum allowed number of query iterations.
Every iteration may find new IRIs to retrieve, because resources loaded on previous iteration may add these IRIs to DB.DBA.RDF_QUAD and make result set longer.
The default value is 1, so the SPARQL processor will retrieve only resources explicitely named in "root" resources or in quad that are in the database before the query execution.
-
input:grab-base specifies a base IRI used to convert relative IRIs into absolute. The default is an empty string.
-
input:grab-resolver is a name of procedure that resolve IRIs and determines the HTTP method of retrieval.
The default is name of DB.DBA.RDF_GRAB_RESOLVER_DEFAULT() procedure that is described below.
If other procedure is specified, the signature should match to the default one.
-
input:grab-destination is to override the default behaviour of the IRI dereferencing and store all retrieved triples in a single graph.
This is convenient when there's no logical difference where any given triple comes from, and changes in remote resources will only add triples but not make cached triples obsolete.
A SPARQL query is usually faster when all graph IRIs are fixed and there are no graph group patterns with an unbound graph variable, so storing everything in one single graph is worth considering.
-
input:grab-loader is a name of procedure that retrieve the resource via HTTP, parce it and store it.
The default is name of DB.DBA.RDF_SPONGE_UP() procedure; this procedure also used by IRI dereferencing for FROM clauses.
You will probably never need to write your own procedure of this sort but some Virtuoso plugins will provide ready-to-use functions that will retrieve non-RDF resources and extract their metadata as triples or
will implement protocols other than HTTP.
Default resolver procedure is DB.DBA.RDF_GRAB_RESOLVER_DEFAULT(). Note that the function produce two absolute URIs,
abs_uri and dest_uri. Default procedure returns two equal strings, but other may return different values,
e.g., return primary and permanent location of the resource as dest_uri and the fastest known mirror location as
abs_uri thus saving HTTP retrieval time. It can even signal an error to block the downloading of some unwanted resource.
DB.DBA.RDF_GRAB_RESOLVER_DEFAULT (
in base varchar, -- base IRI as specified by input:grab-base pragma
in rel_uri varchar, -- IRI of the resource as it is specified by input:grab-iri or a value of a variable
out abs_uri varchar, -- the absolute IRI that should be downloaded
out dest_uri varchar, -- the graph IRI where triples should be stored after download
out get_method varchar ) -- the HTTP method to use, should be "GET" or "MGET".