www.openlinksw.com
docs.openlinksw.com

Book Home

Contents
Preface

Server Administration

Database
Virtual Database
Virtuoso User Model
VAD - Virtuoso Application Distribution
Data Backup & Recovery
Performance Tuning
I/O Schema Design Considerations Efficient Use of SQL - SQL Execution profiling Meters & System Views Transaction Metrics, Diagnostics and Optimization

6.6. Performance Tuning

6.6.1. I/O

6.6.1.1. Optimizing Disk I/O

Virtuoso allows splitting a database over several files that may be on different devices. By allocating database fragments onto independent disks I/O performance in both random and sequential database operations can be greatly enhanced.

The basic unit of a database is the segment. A segment consists of an integer number of 8K pages. A segment may consist of one or more files called stripes. If a segment has multiple stripes these will be of identical length and the segment size will be an integer multiple of the stripe size.

The size limit on individual database files is platform dependent, but 64 bit file offsets are used where available. For large databases use of multiple disks and segments is recommended for reasons of parallelism even though a single database file can get very large. A database can in principle grow up to 32TB (32-bit page number with 8KB per page).

When a segment is striped each logically consecutive page resides in a different file, thus for a segment of 2 stripes the first stripe will contain all even numbered pages and the second all the odd numbered pages. The stripes of a segment should always be located on independent disks.

In serving multiple clients that do random access to tables the server can perform multiple disk operations in parallel, taking advantage of the independent devices. Striping guarantees a statistically uniform access frequency to all devices holding stripes of a segment.

The random access advantages of striping are available without any specific configuring besides that of the stripes themselves.


6.6.1.2. Configuring I/O queues

Striping is also useful for a single client doing long sequential read operations. The server can detect the serial nature of an operation, for example a count of all rows in a table and can intelligently prefetch rows.

If the table is located in a striped segment then the server will read all relevant disks in parallel if these disks are allocated to different I/O queues.

All stripes of different segments on one device should form an I/O queue. The idea is that the database stripes that benefit from being sequentially read form a separate queue. All queues are then read and written independently, each on a different thread. This means that a thread will be allocated per queue. If no queues are declared all database files, even if located on different disks share one queue.

A queue is declared in the striping section by specifying a stripe id after the path of the file.

[Striping]
Segment1 = 200M, (disk1/db-seg1-1.db iq1, (disk2/db-seg1-2.db iq2
Segment2 = 200M, (disk1/db-seg2-1.db iq1, (disk2/db-seg2-2.db iq2

In the above example the first stripes of the segments form one queue and the second stripes form another. This makes sense because now all database files on /disk1 are in iq1 and all on /disk2 are on iq2.

This configuration could have resulted from originally planning a 200 MB database split on 2 disks and subsequently expanding that by another 200 MB.

The I/O queue identifier can be an arbitrary string. As many background I/O threads will be made as there are distinct I/O queues.

Striping and using I/O queues can multiply sequential read rates by a factor almost equal to the number of independent disks. On the other hand assigning stripes on one disk to different queues can have a very detrimental effect. The rule is that all that is physically accessed sequentially will be on the same queue.



6.6.2. Schema Design Considerations

6.6.2.1. Data Organization

One should keep the following in mind when designing a schema for maximum efficiency.


6.6.2.2. Index Usage

A select from a table using a non-primary key will need to retrieve the main row if there are search criteria on columns appearing on the main row or output columns that have to be fetched from the main row. Operations are noticeably faster if they can be completed without fetching the main row if the driving key is a non-primary key. This is the case when search criteria and output columns are on the secondary key parts or primary key parts. Note that all secondary keys contain a copy of the primary key. For this purpose it may be useful to add trailing key parts to a secondary key. Indeed, a secondary key can hold all the columns of a row as trailing key parts. This slows insert and update but makes reference to the main row unnecessary when selecting using the secondary key.

A sequential read on the primary key is always fastest. A sequential search with few hits can be faster on a secondary key if the criterion can be evaluated without reference to the main row. This is because a short key can have more entries per page.


6.6.2.3. Space Consumption

Each column takes the space 'naturally' required by its value. No field lengths are preallocated. Space consumption for columns is the following:

Table: 6.6.2.3.1. Data type Space Consumption
Data Bytes
Integer below 128 1
Smallint 2
long 4
float 4
timestamp 10
double 8
string 2 + characters
NULL data length for fixed length column, as value of 0 length for variable length column.
BLOB 88 on row + n x 8K (see note below)

If a BLOB fits in the remaining free bytes on a row after non-LOBs are stored, it is stored inline and consumes only 3 bytes + BLOB length.

Each index entry has an overhead of 4 bytes. This applies to the primary key as well as any other keys. The length of the concatenation of the key parts is added to this. For the primary key the length of all columns are summed. For any other key the lengths of the key parts plus any primary key parts not on the secondary key are summed. The maximum length of a row is 4076 bytes.

In light of these points primary keys should generally be short.


6.6.2.4. Page Allocation

For data inserted in random order pages tend to be 3/4 full. For data inserted in ascending order pages will be about 90% full due to a different splitting point for a history of rising inserts.



6.6.3. Efficient Use of SQL - SQL Execution profiling

Virtuoso offers an execution profiling mechanism that keeps track of the relative time consumption and response times of different SQL statements.

Profiling can be turned on or off with the prof_enable function. When profiling is on, the real time between the start and end of each SQL statement execute call is logged on the server. When prof_enable is called for the second time the statistics gathered between the last call to prof_enable and this call are dumped to the virtprof.out file in the server's working directory.

Profiling is off by default. Profiling can be turned on with the statement:

prof_enable (1);

The virtprof.out file will be generated when prof_enable is called for the second time, e.g.

prof_enable (0);

will write the file and turn profiling back off.

Below is a sample profile output file:

Query Profile (msec)
Real 183685, client wait 2099294, avg conc 11.428772 n_execs 26148 avg exec  80

99 % under 1 s
99 % under 2 s
99 % under 5 s
100 % under 10 s
100 % under 30 s

23 stmts compiled 26 msec, 99 % prepared reused.

 %  total n-times n-errors
49 % 1041791  7952     53     new_order
28 % 602789   8374     490   delivery_1
12 % 259833   8203     296   payment
5  % 123162   821      35    slevel
2  % 54182    785      0     ostat (?, ?, ?, ?)
0  % 11614    4        0     checkpoint
0  % 2790     2        1     select no_d_id, count (*) from new_orde
0  % 2457     3        1     select count (*) from new_order
0  % 662      2        0     status ()
0  % 11       1        1     set autocommit on
0  % 3        1        0     select * from district

This file was produced by profiling the TPC C sample application for 3 minutes. The numbers have the following meanings:

The real time is the real time interval of the measurement, that is the space in time between the prof_enable call that started the profiling and the call that wrote the report. The client wait time is the time cumulatively spent inside the execute call server side, only calls completely processed between profiling start and end are counted. The average concurrency is the exec time divided by real time and indicates how many requests were on the average concurrently pending during the measurement interval. The count of executes and their average duration is also shown.

The next section shows the percentage of executes that finished under 1, 2, 5, 10 and 30 seconds of real time.

The compilation section indicates how many statements were compiled during the interval. These will be SQLPrepare calls or SQLExecDirect calls where the text of the statement was not previously known to the server. The real time spent compiling the statements is added up. The percentage of prepared statement reuses, that is, executes not involving compiling over all executes is shown. This is an important performance metric, considering that it is always better to use prepared statements with parameters than statements with parameters as literals.

The next section shows individual statements executed during the measurement interval sorted by descending cumulative real time between start and completion of the execute call.

The table shows the percentage of real time spent during the calls to the statement as a percentage of the total real time spent in all execute calls. Note that these real times can be higher than the measurement interval since real times on all threads are added up.

The second column shows the total execute real time, the third column shows the count of executes of the statement during the measurement interval. The fourth column is the count of executes that resulted in an error. The error count can be used for instance to spot statements that often produce deadlocks.

Statements are distinguished for profiling purposes using the 40 first characters of their text. Two distinct statements that do not differ in their first 40 characters will be considered the same for profiling. If a statement is a procedure call then only the name of the procedure will be considered, not possibly differing literal parameters.

The profiler will automatically write the report after having 10000 distinct statements in the measurement interval. This is done so as to have a maximum on the profiling memory consumption for applications that always compile a new statement with different literals, resulting in a potentially infinitely long list of statements each executed once. It is obvious that such a profile will not be very useful.

It cannot be overemphasized that if an application does any sort of repetitive processing then this should be done with prepared statements and parameters, for both reasons of performance and profilability.

Note:

Note that measurements are made with a one millisecond precision. Percentages are rounded to 2 digits. Timing of fast operations, under a few milliseconds, will be imprecise as a result of the 1 ms resolution. Also the cumulative compilation time may be substantially off, since the compilation may often take less than 1 ms at a time. Also note that the precision may also vary between platforms.


6.6.4. Meters & System Views

6.6.4.1. DB.DBA.SYS_K_STAT, DB.DBA.SYS_D_STAT, DB.DBA.SYS_L_STAT view

These views provide statistics on the database engine

create view SYS_K_STAT as
  select KEY_TABLE, name_part (KEY_NAME, 2) as index_name,
	key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'n_landings') as landed,
	key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'total_last_page_hits') as consec,
	key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'page_end_inserts') as right_edge,
	key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'page_end_inserts') as lock_esc
	from SYS_KEYS;
create view SYS_L_STAT as
  select KEY_TABLE, name_part (KEY_NAME, 2) as index_name,
	key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_set') as locks,
	key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_waits') as waits,
	(key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_waits') * 100)
	  / (key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_set') + 1) as wait_pct,
	key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'deadlocks') as deadlocks,
	key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_escalations') as lock_esc
	from SYS_KEYS;
create view sys_d_stat as
  select KEY_TABLE, name_part (KEY_NAME, 2) as index_name,
	key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'touches') as touches,
	key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'reads') as reads,
	(key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'reads') * 100)
	
 > / (key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'touches') + 1) as read_pct
	from SYS_KEYS;

These views offer detailed statistics on index access locality, lock contention and disk usage.


6.6.4.2. SYS_K_STAT - Key statistics


6.6.4.3. SYS_L_STAT


6.6.4.4. SYS_D_STAT

Examples:
select index_name, locks, waits, wait_pct, deadlocks
    from sys_l_stat order by 2 desc;

Get lock data, indices in descending order of lock count.

select index_name, touches, reads, read_pct
    from sys_d_stat order by 3 desc;

Get disk read counts, index with most reads first.

select index_name, (consec * 100) / (landed + 1)
    from sys_k_stat where landed > 1000  order by 2;

Get the percentage of consecutive page access on indices with over 1000 accesses so far, most randomly accessed first.


6.6.4.5. status SQL function - status ();

This function returns a summary of the database status as a result set. The result set has one varchar column, which has consecutive lines of text. The lines can be up to several hundred characters.

The contents of the status summary are described in the Administrator's Guide.


6.6.4.6. Virtuoso db file usage detailed info

All data in a virtuoso database are logically stored as database key rows. Thus the primary key for a table holds the entire row (including the dependent part) and the secondary keys just hold their respective key parts. So the space that the table occupies is the sum of the space occupied by it's primary key and all the secondary keys.

The main physical unit of allocation in a virtuoso db file is the database page (about 8k in virtuoso 3.x). So the server needs to map the key rows and outline blobs to database pages.

Virtuoso will store as many rows in a db page as it can, so usually one DB page will contain more than 1 row of a given key. No page contains rows from more than one key. However blobs (when not inlined on the row) will be placed in consecutive DB pages (up to their size). In addition to the blob and key pages the Virtuoso DB will hold a number of pages containing internal data. So the sum of the pages occupied by the key rows and the blobs is leas then the amount of occupied pages (as reported by the status() BIF).

To provide detailed information about the space consumption of each key there's a system view:

DB.DBA.SYS_INDEX_SPACE_STATS
    ISS_KEY_TABLE       varchar -- name of the table
    ISS_KEY_NAME        varchar -- name of the key
    ISS_KEY_ID          integer -- id of the key (corresponding to KEY_ID from DB.DBA.SYS_KEYS)
    ISS_NROWS           integer -- number of rows in the table
    ISS_ROW_BYTES       integer -- sum of the byte lengths of all the rows in the table
    ISS_BLOB_PAGES      integer -- sum of the blob pages occupied by the outline blobs on all the rows of the table
    ISS_ROW_PAGES       integer -- sum of all the db pages containing rows of this key
    ISS_PAGES           integer -- = ISS_BLOB_PAGES + ISS_ROW_PAGES (for convenience).

Each select on that view causes the server to go over all the db pages in the db file (similarly to how the crash dump operates) and collect the statistics above. The pages are traversed 1 time per select, but still on large database files that may take some time.



6.6.5. Transaction Metrics, Diagnostics and Optimization

Transaction-Isolation Levels in Virtuoso Metrics, Diagnostics and Optimization of Transactions

Bad design and implementation of transactions affects applications in two ways:

aborted by deadlock. keeping data structures for too many simultaneous locks, rollback records and uncommitted roll forward logs.

The following rules should be observed when writing transactions:

the data once can always be done as read committed instead of repeatable read without affecting semantics. Even if some data is read multiple times, the repeatable read semantic is typically not relevant for reports. read with exclusive locks. See the for update clause in select, for example. transaction, lock them always in the same order. When updating stock for an order, update the quantity on hand in increasing order of item number, for instance. detail before updating the summary. Update the quantity in stock for the ordered items before updating the orders count of the whole warehouse. commit work statement. deadlocked, the deadlocked transaction gets retried. For example, have a "declare exit handler for sqlstate 40001" for every transaction context. Make sure that a deadlocking transaction is never retried endlessly. Two mutually deadlocking transactions can keep retrying and again deadlocking each other endlessly. To avoid this, have a maximum count of retries and a random delay before restarting. The restart delay should be between 0 and the expected duration of the transaction. thousand or tens of thousands of rows per transaction, never more than that. Failing to do this makes for prohibitive cost of retry with deadlocks and can cause swapping by keeping tens or hundreds of megabytes in rollback state, locks and other transaction temporary structures. This happens if one inserts, updates, deletes several million rows in a single transaction. If this is really needed and concurrency is no issue, use the atomic mode, effectively making the server single user for the transaction, thus having no locking or rollback. See __atomic (). Programming Virtuoso/PL The isolation level is set in Virtuoso/PL with the
set isolation := level;

statement, where level is one of 'serializable', 'repeatable', 'committed', 'uncommitted'. Example :

set isolation = 'serializable';

The standard SQL syntax is also supported :

SET TRANSACTION ISOLATION LEVEL <isolation_level>
isolation level : READ UNCOMMITTED | READ COMMITTED | REPEATABLE READ | SERIALIZABLE
Example :
SET TRANSACTION ISOLATION LEVEL READ COMMITTED

The effect is for the rest of the procedure and any procedure called from this procedure. The effect stops when the procedure having executed the set isolation statement returns.

Sample Deadlock Handler

The text for a deadlock handler is

  declare retry_count int;
  retry_count := 0;
retry:
  {
    declare exit handler for sqlstate '40001' 
      {
        rollback work;
        ... 
        delay (rnd (2.5));  --- if 2.5 seconds is the expected duration of
the transaction.
        ...

        retry_count := retry_count + 1;
        if (retry_count > 5)
          signal ("xxxxx", "Too many deadlock retries in xxxxx.");
        goto retry;
      }
   -- do the operations.  The actual working code here.

    commit work;
  }

An exclusive read is done with

select s_quantity from stock where s_i_id = 111 for update;
ODBC

For the Virtuoso ODBC driver the isolation is set by :

   rc = SQLSetConnectOption (hdbc, SQL_TXN_ISOLATION, SQL_TXN_READ_COMMITTED);
or
  rc = SQLSetConnectAttr (hdbc, SQL_TXN_ISOLATION, SQL_TXN_READ_COMMITTED, NULL);

Constants are : SQL_TXN_READ_UNCOMMITTED, SQL_TXN_READ_COMMITTED, SQL_TXN_REPEATABLE_READ, SQL_TXN_SERIALIZABLE

   SQLDriverConnect (hdbc, hwnd, 
       "DSN=MyDSN;IsolationLevel=Repeatable Read;UID=dba;PWD=dbapwd", SQL_NTS, 
       NULL, 0, 
       NULL,
       SQL_DRIVER_NOPROMPT).
The possible options for the connection string are : "Read Uncommitted", "Read Committed", "Repeatable Read" and "Serializable" JDBC

In the Virtuoso JDBC driver the isolation is set by the java.sql.Connection.setTransactionIsolation() JDBC API.

  conn.setTransactionIsolation (java.sql.Connection.TRANSACTION_SERIALIZABLE)
The constants are described here NET

In the VirtuosoClient.NET provider the isolation is set by the System.Data.IDbConnection.BeginTransaction Method (IsolationLevel) function.

  System.Data.IDBTransaction trx = conn.BeginTrasnaction (System.Data.IsolationLevel.ReadCommitted)

The constants are described here

Metrics and Diagnostics

Metrics are presented at the server and the table level.

The first metric to check is the output of status ('');

The paragraph titled transaction status contains the following:

deadlocks and the number of 2r1w deadlocks. The latter is a special case where two transactions both hold a shared lock on a resource and one tries to convert the lock to exclusive. This situation can kill the transaction attempting to write. Such deadlocks are essentially always needless. These are avoided by reading for update when first reading the resource. some operation waits for a lock, except if this wait leads into a deadlock. If the number of deadlocks is high, let's say over 5% of the number of waits, transactions are likely badly designed and deadlock too often, either because of not locking for write at the get go (2r1w) or because of locking resources in varying order. somehow occupied, whether running or waiting. This count minus the count of waiting minus the count of threads in vdb is the count of threads that in principle could be on CPU. waiting for a lock. If this is a high percentage of the count of threads running, say over 30%, resources are likely locked inefficiently, keeping too many locked. time waiting for I/O either from a remote database or any sort of network operation, including access to web services on other serbers, access to web pages on other hosts etc.

The system view db.dba.sys_l_stat is used for locating bottlenecks.

The columns are:

* Table - The table in question

key, the lock is set on the index first, only then on the pk, that is if the pk is accessed at all. For all updates however, the pk will always be accessed. *locks - The count of times a lock was set on this index. be more waits than locks because a read committed cursor can wait but will not lock, thus all waits do not result in locks. waits) / locks attempting to wait for a lock onthis index. page lock on this index. for a lock on this index. This may be greater than elapsed time because many threads can wait at the same time.

All counts and times are cumulative from server startup.

The interpretation is as follows:

If deadlocks is high in relation to waits or locks, i.e. over 5%, there are many deadlocks and the transaction profiles may have to be adjusted. The table where deadlocks is incremented is the table where the deadlock was detected but the deadlock may involve any number of tables. So, if A and B are locked in the order A, B half of the time and B, a the rest of the time, then the deadlocks of the tables of A and B will be about the same, half of the deadlocks being detected when locking A, the other half when locking B.

If waits is high in relation to locks, for example 20%, then there is probably needless contention. Things are kept locked needlessly. Use read committed or make shorter transactions or lock items with the less contention first.

Because transaction duration varies, the place with the highest count of waits is not necessarily the place with the heaviest contention if the waits are short. Use wait_msecs in addition to waits for determining where the waiting takes place.

To get a general picture, use the Conductor's Statistics page or simply do

select top 5 * from sys_l_statt order by wait_msecs desc;

to get a quick view of where time is spent. You can also sort by waits desc or locks desc.

SQL Issues

It is possible to get bad locking behavior if the SQL compiler decides to make linear scans of tables or indices and the isolation is greater than read committed. The presence of a linear scan on an index with locking is often indicated by having a large number of lock escalations. If lock_esc is near locks then a large part of the activity is likely sequential reads.

The general remedy is to do any long report type transactions as read committed unless there are necessary reasons to do otherwise.

To see how a specific query is compiled, one can use the explain () function. To change how a query is compiled, one can use the table option or option SQL clauses.

Observation of Dynamics

Deadlocks and contention do not occur uniformly across time. The occurrences will sharply increase after a certain application dependent load threshold is exceeded.

Also, deadlocks will occur in batches. Several transactions will first wait for each other and then retry maybe several times, maybe only one succeeding at every round. In such worst cases, there will be many more deadlocks than successfully completed transactions. Optimize locking order and make the transactions smaller.

Looking at how counts change, specially if they change in bursts is useful.

Tracing and Debugging

The most detailed picture of a system's behavior, including deadlocks nd other exceptions cn be obtained with profiling. If the application is in C, Java or some other compiled language, one can use the language's test coverage facility to see execution counts for various branches of the code.

For client applications, using the Virtuoso trace () function can be useful for seeing which statements signal errors. Also the profiling report produced by prof_enable () can give useful hints on execution times and error frequencies. See Profiling and prof_enable ().

For PL applications, Virtuoso provides a profiler and test coverage facility. This is activated by setting PLDebug = 2 in the Parameters section of the ini file and starting the server. The functions cov_store () and cov_report are used for obtaining a report of the code execution profile thus far. See the documentation on "Branch Coverage" for this. The execution counts of the lines for exception handler code will show the frequency of exceptions. If in a linear block of code a specific line has an execution count lower than that of the line above it, then this means that the line with the higher count has signalled as many exceptions as the difference of the higher and lower count indicates.

The times indicated in the flat and call graph profile reports for PL procedures are totals across all threads and include time spent waiting for locks. Thus a procedure that consumes almost no CPU can appear high on the list if it waits for locks, specially if this is so on multiple threads concurrently. The times are real times measured on threads separately and can thus exceed any elapsed real tiime.

Single stepping is not generally useful for debugging locking since locking issues are timing sensitive.