www.openlinksw.com
docs.openlinksw.com

Book Home

Contents
Preface

RDF Database and SPARQL

Data Representation
IRI_ID Type RDF_QUAD and other tables Short, Long and SQL Values Special Cases and XML Schema Compatibility SQL Compiler Support - QUIETCAST option
RDF and SPARQL API and SQL
IRI Dereferencing
RDF Views -- Mapping Relational Data to RDF
SPARQL Implementation
RDF Inference in Virtuoso
Using Full Text Search in SPARQL
Aggregates in SPARQL
Virtuoso SPARQL Query Service

14.1. Data Representation

This section covers how Virtuoso stores RDF triples. The IRI_ID built-in data type is introduced, along with the default table structures used for triple persistency.

14.1.1. IRI_ID Type

The central notion of RDF is the IRI, or URI, which serves as the globally unique label of named nodes. The subject and predicate of a triple are always IRI's and the object may be an IRI or any other XML Schema scalar data type. In any case, an IRI is always distinct from any instance of any other data type.

Virtuoso supports a native IRI_ID data type, internally an unsigned 32 bit integer value. When compared with other IRI's, the collation is as with unsigned 32 bit integers. An IRI_ID is never equal to any instance of any other type.

Thus, the object column of a table storing triples can be declared as ANY and IRI values will be distinguishable without recourse to any extra flag and IRI's will naturally occupy their own contiguous segment in the ANY type collation sequence. Indices can be defined over such columns. An IRI_ID is never automatically cast into any other type nor any other type into IRI_ID.

The functions iri_id_num (in i IRI_ID) and iri_id_from_num (in n INT) convert between signed 32 bit integers and IRI_ID's. The function isiri_id (in i any) returns nonzero if the argument is of type IRI_ID, zero otherwise.

The syntax for an IRI_ID literal is

#i<nnn>, where nnn is up to 10 decimal digits.
#i12345 is equal to iri_id_from_num (12345)

When received by a SQL client application, the ODBC driver or interactive SQL will bind an IRI_ID to a character buffer, producing the #innn syntax. When passing IRI_ID's from a client, one can pass an integer and use the iri_id_from_num () function in the statement to convert server side. A SQL client will normally not be exposed to IRI_ID's since the SPARQL implementation returns IRI's in their text form, not as internal id's. These will however be seen if reading the internal tables directly.


14.1.2. RDF_QUAD and other tables

The main tables of the default RDF storage system are:

create table DB.DBA.RDF_QUAD (
  G IRI_ID,
  S IRI_ID,
  P IRI_ID,
  O any,
  primary key (G,S,P,O) );
create bitmap index RDF_QUAD_OGPS on DB.DBA.RDF_QUAD (O, G, P, S);

Each triple is represented by one row in RDF_QUAD. The columns represent the graph, subject, predicate and object. The IRI_ID type columns reference RDF_IRI, which translates the internal id to the external name of the IRI. The O column is of type ANY. If the O value is a non-string SQL scalar, such as a number or date or IRI, it is stored in its native binary representation. If it is a "very short" string (20 characters or less), it is also stored "as is". Long strings and RDF literal with non-default type or language are stored using special data type called "rdf_box". Instance of rdf_box consists of data type, language, the content (or beginning characters of a long content). and a possible reference to RDF_OBJ if the object is too long to be held in-line in this table or should be outlined for free-text indexing.

create table DB.DBA.RDF_PREFIX (
  RP_NAME varchar primary key,
  RP_ID int not null unique );
create table DB.DBA.RDF_IRI (
  RI_NAME varchar primary key,
  RI_ID IRI_ID not null unique );

These two tables store a mapping between internal IRI id's and their external string form. A memory-resident cache contains recently used IRIs to reduce access to this table. Function id_to_iri (in id IRI_ID) returns the IRI by its ID. Function iri_to_id (in iri varchar, in may_create_new_id) returns an IRI_ID for given string; if the string is not used before as an IRI then either NULL is returned or a new ID is allocated, depending on the second argument.

create table DB.DBA.RDF_OBJ (
  RO_ID integeR primary key,
    RO_VAL varchar,
  RO_LONG long varchar,
  RO_DIGEST any
)
create index RO_VAL on DB.DBA.RDF_OBJ (RO_VAL)
create index RO_DIGEST on DB.DBA.RDF_OBJ (RO_DIGEST)
;

When an O value of RDF_QUAD is longer than a certain limit or should be free-text indexed, the value is stored in this table. Depending on the length of the value, it goes into the varchar or the long varchar column. The RO_ID is contained in rdf_box object that is stored in the O column. Still, the truncated value of O can be used for determining equality and range matching, even even if < and > of closely matching values need to look at the real string in RDF_OBJ. When RO_LONG is used to store very long value, RO_VAL contains a simple checksum of the value, to accelerate search for identical values when the table is populated by new values.

create table DB.DBA.RDF_DATATYPE (
    RDT_IID IRI_ID not null primary key,
    RDT_TWOBYTE integer not null unique,
    RDT_QNAME varchar );

The XML Schema data type of a typed string O represented as 2 bytes in the O varchar value. This table maps this into the broader IRI space where the type URI is given an IRI number.

create table DB.DBA.RDF_LANGUAGE (
    RL_ID varchar not null primary key,
    RL_TWOBYTE integer not null unique );

The varchar representation of a O which is a string with language has a two byte field for language. This table maps the short integer language id to the real language name such as 'en', 'en-US' or 'x-any'.

Note that unlike datatype names, language names are not URIs.

A short integer value can be used in both RDF_DATATYPE and RDF_LANGUAGE tables for two different purposes. E.g. an integer 257 is for 'unspecified datatype' as well as for 'unspecified language'.


14.1.3. Short, Long and SQL Values

When processing an O, the SPARQL implementation may have it in one of three internal formats, called "valmodes". The below cases apply for strings:

The short format is the format where an O is stored in RDF_QUAD.

The long value is an rdf_box object, that consists of six fields:

The SQL value is the string as a narrow string representing the UTF8 encoding of the value, stripped of data type and language tag.

The SQL form of an IRI is the string. The long and short forms are the IRI_ID referencing RU_IRI_ID of RDF_URL.

For all non-string, non-IRI types, the short, long and SQL values are the same SQL scalar of the appropriate native SQL type. A SQL host variable meant to receive an O should be of the ANY type.

The SPARQL implementation will usually translate results to the SQL format before returning them. Internally, it uses the shortest possible form suited to the operation. For equalities and joining, the short form is always good. For range comparisons, the long form is needed etc. For arithmetic, all three forms will do since the arguments are expected to be numbers which are stored as their binary selves in O, thus the O column unaltered and uncast will do as an argument of arithmetic or numeric comparison with, say, SQL literal constsnt.


14.1.4. Special Cases and XML Schema Compatibility

We note that since we store numbers as the equivalent SQL binary type, we do not preserve the distinction of byte, boolean etc. These all become integer. If preserving such detail is for some reason important, then storage as a typed string is possible but is not done at present for reasons of compactness and performance.


14.1.5. SQL Compiler Support - QUIETCAST option

The type cast behaviors of SQL and SPARQL are different. SQL will generally signal an error when an automatic cast fails. For example, a string can be compared to a date column if the string can be parsed as a date but otherwise the comparison should signal an error. In SPARQL, such situations are supposed to silently fail. Generally, SPARQL is much more relaxed with respect to data types.

These differences will be specially noticed if actual SQL data is processed with SPARQL via some sort of schema mapping translating references to triples into native tables and columns.

Also, even when dealing with the triple-oriented RDF_QUAD table, there are cases of joining between S and O such that the O can be a heterogenous set of IRI's and other data whereas the S is always an IRI. The non-IRI to IRI comparison should not give cast errors but should silently fail. Also, in order to keep queries simple and easily optimizable, it should not be necessary to introduce extra predicates for testing if the O is n IRI before comparing with the S.

Due to these considerations, Virtuoso introduces a SQL statement option called QUIETCAST. When given in the OPTION clause of a SELECT, it switches to silent fail mode for automatic type casting.

The syntax is as follows:

select ... from .... option (QUIETCAST)

This option is automatically added by the SPARQL to SQL translator. The scope is the enclosing procedure body.