Virtuoso allows splitting a database over several files that may be on different devices. By allocating database fragments onto independent disks I/O performance in both random and sequential database operations can be greatly enhanced.
The basic unit of a database is the segment. A segment consists of an integer number of 8K pages. A segment may consist of one or more files called stripes. If a segment has multiple stripes these will be of identical length and the segment size will be an integer multiple of the stripe size.
The size limit on individual database files is platform dependent, but 64 bit file offsets are used where available. For large databases use of multiple disks and segments is recommended for reasons of parallelism even though a single database file can get very large. A database can in principle grow up to 32TB (32-bit page number with 8KB per page).
When a segment is striped each logically consecutive page resides in a different file, thus for a segment of 2 stripes the first stripe will contain all even numbered pages and the second all the odd numbered pages. The stripes of a segment should always be located on independent disks.
In serving multiple clients that do random access to tables the server can perform multiple disk operations in parallel, taking advantage of the independent devices. Striping guarantees a statistically uniform access frequency to all devices holding stripes of a segment.
The random access advantages of striping are available without any specific configuring besides that of the stripes themselves.
Striping is also useful for a single client doing long sequential read operations. The server can detect the serial nature of an operation, for example a count of all rows in a table and can intelligently prefetch rows.
If the table is located in a striped segment then the server will read all relevant disks in parallel if these disks are allocated to different I/O queues.
All stripes of different segments on one device should form an I/O queue. The idea is that the database stripes that benefit from being sequentially read form a separate queue. All queues are then read and written independently, each on a different thread. This means that a thread will be allocated per queue. If no queues are declared all database files, even if located on different disks share one queue.
A queue is declared in the striping section by specifying a stripe id after the path of the file.
[Striping] Segment1 = 200M, (disk1/db-seg1-1.db iq1, (disk2/db-seg1-2.db iq2 Segment2 = 200M, (disk1/db-seg2-1.db iq1, (disk2/db-seg2-2.db iq2
In the above example the first stripes of the segments form one queue and the second stripes form another. This makes sense because now all database files on /disk1 are in iq1 and all on /disk2 are on iq2.
This configuration could have resulted from originally planning a 200 MB database split on 2 disks and subsequently expanding that by another 200 MB.
The I/O queue identifier can be an arbitrary string. As many background I/O threads will be made as there are distinct I/O queues.
Striping and using I/O queues can multiply sequential read rates by a factor almost equal to the number of independent disks. On the other hand assigning stripes on one disk to different queues can have a very detrimental effect. The rule is that all that is physically accessed sequentially will be on the same queue.
One should keep the following in mind when designing a schema for maximum efficiency.
A select from a table using a non-primary key will need to retrieve the main row if there are search criteria on columns appearing on the main row or output columns that have to be fetched from the main row. Operations are noticeably faster if they can be completed without fetching the main row if the driving key is a non-primary key. This is the case when search criteria and output columns are on the secondary key parts or primary key parts. Note that all secondary keys contain a copy of the primary key. For this purpose it may be useful to add trailing key parts to a secondary key. Indeed, a secondary key can hold all the columns of a row as trailing key parts. This slows insert and update but makes reference to the main row unnecessary when selecting using the secondary key.
A sequential read on the primary key is always fastest. A sequential search with few hits can be faster on a secondary key if the criterion can be evaluated without reference to the main row. This is because a short key can have more entries per page.
Each column takes the space 'naturally' required by its value. No field lengths are preallocated. Space consumption for columns is the following:
Data | Bytes |
---|---|
Integer below 128 | 1 |
Smallint | 2 |
long | 4 |
float | 4 |
timestamp | 10 |
double | 8 |
string | 2 + characters |
NULL | data length for fixed length column, as value of 0 length for variable length column. |
BLOB | 88 on row + n x 8K (see note below) |
If a BLOB fits in the remaining free bytes on a row after non-LOBs are stored, it is stored inline and consumes only 3 bytes + BLOB length.
Each index entry has an overhead of 4 bytes. This applies to the primary key as well as any other keys. The length of the concatenation of the key parts is added to this. For the primary key the length of all columns are summed. For any other key the lengths of the key parts plus any primary key parts not on the secondary key are summed. The maximum length of a row is 4076 bytes.
In light of these points primary keys should generally be short.
For data inserted in random order pages tend to be 3/4 full. For data inserted in ascending order pages will be about 90% full due to a different splitting point for a history of rising inserts.
Virtuoso offers an execution profiling mechanism that keeps track of the relative time consumption and response times of different SQL statements.
Profiling can be turned on or off with the prof_enable function. When profiling is on, the real time between the start and end of each SQL statement execute call is logged on the server. When prof_enable is called for the second time the statistics gathered between the last call to prof_enable and this call are dumped to the virtprof.out file in the server's working directory.
Profiling is off by default. Profiling can be turned on with the statement:
prof_enable (1);
The virtprof.out file will be generated when prof_enable is called for the second time, e.g.
prof_enable (0);
will write the file and turn profiling back off.
Below is a sample profile output file:
Query Profile (msec) Real 183685, client wait 2099294, avg conc 11.428772 n_execs 26148 avg exec 80 99 % under 1 s 99 % under 2 s 99 % under 5 s 100 % under 10 s 100 % under 30 s 23 stmts compiled 26 msec, 99 % prepared reused. % total n-times n-errors 49 % 1041791 7952 53 new_order 28 % 602789 8374 490 delivery_1 12 % 259833 8203 296 payment 5 % 123162 821 35 slevel 2 % 54182 785 0 ostat (?, ?, ?, ?) 0 % 11614 4 0 checkpoint 0 % 2790 2 1 select no_d_id, count (*) from new_orde 0 % 2457 3 1 select count (*) from new_order 0 % 662 2 0 status () 0 % 11 1 1 set autocommit on 0 % 3 1 0 select * from district
This file was produced by profiling the TPC C sample application for 3 minutes. The numbers have the following meanings:
The real time is the real time interval of the measurement, that is the space in time between the prof_enable call that started the profiling and the call that wrote the report. The client wait time is the time cumulatively spent inside the execute call server side, only calls completely processed between profiling start and end are counted. The average concurrency is the exec time divided by real time and indicates how many requests were on the average concurrently pending during the measurement interval. The count of executes and their average duration is also shown.
The next section shows the percentage of executes that finished under 1, 2, 5, 10 and 30 seconds of real time.
The compilation section indicates how many statements were compiled during the interval. These will be SQLPrepare calls or SQLExecDirect calls where the text of the statement was not previously known to the server. The real time spent compiling the statements is added up. The percentage of prepared statement reuses, that is, executes not involving compiling over all executes is shown. This is an important performance metric, considering that it is always better to use prepared statements with parameters than statements with parameters as literals.
The next section shows individual statements executed during the measurement interval sorted by descending cumulative real time between start and completion of the execute call.
The table shows the percentage of real time spent during the calls to the statement as a percentage of the total real time spent in all execute calls. Note that these real times can be higher than the measurement interval since real times on all threads are added up.
The second column shows the total execute real time, the third column shows the count of executes of the statement during the measurement interval. The fourth column is the count of executes that resulted in an error. The error count can be used for instance to spot statements that often produce deadlocks.
Statements are distinguished for profiling purposes using the 40 first characters of their text. Two distinct statements that do not differ in their first 40 characters will be considered the same for profiling. If a statement is a procedure call then only the name of the procedure will be considered, not possibly differing literal parameters.
The profiler will automatically write the report after having 10000 distinct statements in the measurement interval. This is done so as to have a maximum on the profiling memory consumption for applications that always compile a new statement with different literals, resulting in a potentially infinitely long list of statements each executed once. It is obvious that such a profile will not be very useful.
It cannot be overemphasized that if an application does any sort of repetitive processing then this should be done with prepared statements and parameters, for both reasons of performance and profilability.
Note that measurements are made with a one millisecond precision. Percentages are rounded to 2 digits. Timing of fast operations, under a few milliseconds, will be imprecise as a result of the 1 ms resolution. Also the cumulative compilation time may be substantially off, since the compilation may often take less than 1 ms at a time. Also note that the precision may also vary between platforms.
These views provide statistics on the database engine
create view SYS_K_STAT as select KEY_TABLE, name_part (KEY_NAME, 2) as index_name, key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'n_landings') as landed, key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'total_last_page_hits') as consec, key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'page_end_inserts') as right_edge, key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'page_end_inserts') as lock_esc from SYS_KEYS;
create view SYS_L_STAT as select KEY_TABLE, name_part (KEY_NAME, 2) as index_name, key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_set') as locks, key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_waits') as waits, (key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_waits') * 100) / (key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_set') + 1) as wait_pct, key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'deadlocks') as deadlocks, key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_escalations') as lock_esc from SYS_KEYS;
create view sys_d_stat as select KEY_TABLE, name_part (KEY_NAME, 2) as index_name, key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'touches') as touches, key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'reads') as reads, (key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'reads') * 100) > / (key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'touches') + 1) as read_pct from SYS_KEYS;
These views offer detailed statistics on index access locality, lock contention and disk usage.
KEY_TABLE The fully qualified table name, e.g. DB.DBA.SYS_PROCEDURES
INDEX_NAME The name of the index. This will be equal to the table name for the table's primary key.
LANDED The count of random accesses, including inserts. Any insert or select, whether empty, single line or multi- line counts as one. Updates and deletes do not count, as they imply a select in the same or previous statement.
CONSEC The number of times a random access falls on the same page as the previous random access. This is always less than LANDED. For repetitive access to the same place or an ascending insert, this will be near LANDED. For a totally random access pattern this will be near 0.
RIGHT_EDGE The number of times an insert has added a row to the right edge of the page where the insert was made.
LOCK_ESC The count of lock escalations, see SYS_L_STAT.
KEY_TABLE The fully qualified table name, e.g. DB.DBA.SYS_PROCEDURES
INDEX_NAME The name of the index. This will be equal to the table name for the table's primary key.
LOCKS The number of times a lock has been set on the index. Making a new row or page lock counts as one. Entering a row or page lock either after a wait or without wait (for a shared lock) counts as one.
WAITS The number of times a cursor reading this index waited for a lock. Note that this can be higher than the number of locks set, e.g. a 'read committed' cursor may wait for a lock but will never make one.
WAIT_PCT The percentage of lock set events that involved a wait.
DEADLOCKS The number of times a deadlock was detected when trying to wait for a lock on this index. Note that one deadlock event may involve locks on several indices. Each deadlock detection counts as one.
LOCK_ESC The number of times the set of row locks on a page of this index where escalated into one page lock or a page lock was set initially. This is always less than LOCKS. This value will be near LOCKS when there are many sequential selects which switch to page lock mode. This happens when a cursor has performed over 2 lock escalations and the page being entered has no locks, i.e. the lock can be set over the entire page.
KEY_TABLE The fully qualified table name, e.g. DB.DBA.SYS_PROCEDURES
INDEX_NAME The name of the index. This will be equal to the table name for the table's primary key.
TOUCHES The number of times a row is located on the index. Every row retrieved by a select or inserted counts as one. All rows scanned by an select count or other aggregate counts as one.
READS The number of times a disk read was caused by a read operation on this index. This may theoretically be higher than TOUCHES, since several levels of the index tree may have to be read to get to a leaf.
READ_PCT The percentage of READS in TOUCHES.
select index_name, locks, waits, wait_pct, deadlocks from sys_l_stat order by 2 desc; Get lock data, indices in descending order of lock count. select index_name, touches, reads, read_pct from sys_d_stat order by 3 desc; Get disk read counts, index with most reads first. select index_name, (consec * 100) / (landed + 1) from sys_k_stat where landed > 1000 order by 2;
Get the percentage of consecutive page access on indices with over 1000 accesses so far, most randomly accessed first.
This function returns a summary of the database status as a result set. The result set has one varchar column, which has consecutive lines of text. The lines can be up to several hundred characters.
The contents of the status summary are described in the Administrator's Guide.
All data in a virtuoso database are logically stored as database key rows. Thus the primary key for a table holds the entire row (including the dependent part) and the secondary keys just hold their respective key parts. So the space that the table occupies is the sum of the space occupied by it's primary key and all the secondary keys.
The main physical unit of allocation in a virtuoso db file is the database page (about 8k in virtuoso 3.x). So the server needs to map the key rows and outline blobs to database pages.
Virtuoso will store as many rows in a db page as it can, so usually one DB page will contain more than 1 row of a given key. No page contains rows from more than one key. However blobs (when not inlined on the row) will be placed in consecutive DB pages (up to their size). In addition to the blob and key pages the Virtuoso DB will hold a number of pages containing internal data. So the sum of the pages occupied by the key rows and the blobs is leas then the amount of occupied pages (as reported by the status() BIF).
To provide detailed information about the space consumption of each key there's a system view:
DB.DBA.SYS_INDEX_SPACE_STATS ISS_KEY_TABLE varchar -- name of the table ISS_KEY_NAME varchar -- name of the key ISS_KEY_ID integer -- id of the key (corresponding to KEY_ID from DB.DBA.SYS_KEYS) ISS_NROWS integer -- number of rows in the table ISS_ROW_BYTES integer -- sum of the byte lengths of all the rows in the table ISS_BLOB_PAGES integer -- sum of the blob pages occupied by the outline blobs on all the rows of the table ISS_ROW_PAGES integer -- sum of all the db pages containing rows of this key ISS_PAGES integer -- = ISS_BLOB_PAGES + ISS_ROW_PAGES (for convenience).
Each select on that view causes the server to go over all the db pages in the db file (similarly to how the crash dump operates) and collect the statistics above. The pages are traversed 1 time per select, but still on large database files that may take some time.
Bad design and implementation of transactions affects applications in two ways:
The following rules should be observed when writing transactions:
set isolation := level;
statement, where level is one of 'serializable', 'repeatable', 'committed', 'uncommitted'. Example :
set isolation = 'serializable';
The standard SQL syntax is also supported :
SET TRANSACTION ISOLATION LEVEL <isolation_level> isolation level : READ UNCOMMITTED | READ COMMITTED | REPEATABLE READ | SERIALIZABLE
SET TRANSACTION ISOLATION LEVEL READ COMMITTED
The effect is for the rest of the procedure and any procedure called from this procedure. The effect stops when the procedure having executed the set isolation statement returns.
Sample Deadlock HandlerThe text for a deadlock handler is
declare retry_count int; retry_count := 0; retry: { declare exit handler for sqlstate '40001' { rollback work; ... delay (rnd (2.5)); --- if 2.5 seconds is the expected duration of the transaction. ... retry_count := retry_count + 1; if (retry_count > 5) signal ("xxxxx", "Too many deadlock retries in xxxxx."); goto retry; } -- do the operations. The actual working code here. commit work; }
An exclusive read is done with
select s_quantity from stock where s_i_id = 111 for update;
For the Virtuoso ODBC driver the isolation is set by :
rc = SQLSetConnectOption (hdbc, SQL_TXN_ISOLATION, SQL_TXN_READ_COMMITTED);
rc = SQLSetConnectAttr (hdbc, SQL_TXN_ISOLATION, SQL_TXN_READ_COMMITTED, NULL);
Constants are : SQL_TXN_READ_UNCOMMITTED, SQL_TXN_READ_COMMITTED, SQL_TXN_REPEATABLE_READ, SQL_TXN_SERIALIZABLE
SQLDriverConnect (hdbc, hwnd, "DSN=MyDSN;IsolationLevel=Repeatable Read;UID=dba;PWD=dbapwd", SQL_NTS, NULL, 0, NULL, SQL_DRIVER_NOPROMPT).
In the Virtuoso JDBC driver the isolation is set by the java.sql.Connection.setTransactionIsolation() JDBC API.
conn.setTransactionIsolation (java.sql.Connection.TRANSACTION_SERIALIZABLE)
In the VirtuosoClient.NET provider the isolation is set by the System.Data.IDbConnection.BeginTransaction Method (IsolationLevel) function.
System.Data.IDBTransaction trx = conn.BeginTrasnaction (System.Data.IsolationLevel.ReadCommitted)
The constants are described here
Metrics and DiagnosticsMetrics are presented at the server and the table level.
The first metric to check is the output of status ('');
The paragraph titled transaction status contains the following:
The system view db.dba.sys_l_stat is used for locating bottlenecks.
The columns are:
* Table - The table in question
All counts and times are cumulative from server startup.
The interpretation is as follows:
If deadlocks is high in relation to waits or locks, i.e. over 5%, there are many deadlocks and the transaction profiles may have to be adjusted. The table where deadlocks is incremented is the table where the deadlock was detected but the deadlock may involve any number of tables. So, if A and B are locked in the order A, B half of the time and B, a the rest of the time, then the deadlocks of the tables of A and B will be about the same, half of the deadlocks being detected when locking A, the other half when locking B.
If waits is high in relation to locks, for example 20%, then there is probably needless contention. Things are kept locked needlessly. Use read committed or make shorter transactions or lock items with the less contention first.
Because transaction duration varies, the place with the highest count of waits is not necessarily the place with the heaviest contention if the waits are short. Use wait_msecs in addition to waits for determining where the waiting takes place.
To get a general picture, use the Conductor's Statistics page or simply do
select top 5 * from sys_l_statt order by wait_msecs desc;
to get a quick view of where time is spent. You can also sort by waits desc or locks desc.
SQL IssuesIt is possible to get bad locking behavior if the SQL compiler decides to make linear scans of tables or indices and the isolation is greater than read committed. The presence of a linear scan on an index with locking is often indicated by having a large number of lock escalations. If lock_esc is near locks then a large part of the activity is likely sequential reads.
The general remedy is to do any long report type transactions as read committed unless there are necessary reasons to do otherwise.
To see how a specific query is compiled, one can use the explain () function. To change how a query is compiled, one can use the table option or option SQL clauses.
Observation of DynamicsDeadlocks and contention do not occur uniformly across time. The occurrences will sharply increase after a certain application dependent load threshold is exceeded.
Also, deadlocks will occur in batches. Several transactions will first wait for each other and then retry maybe several times, maybe only one succeeding at every round. In such worst cases, there will be many more deadlocks than successfully completed transactions. Optimize locking order and make the transactions smaller.
Looking at how counts change, specially if they change in bursts is useful.
Tracing and DebuggingThe most detailed picture of a system's behavior, including deadlocks nd other exceptions cn be obtained with profiling. If the application is in C, Java or some other compiled language, one can use the language's test coverage facility to see execution counts for various branches of the code.
For client applications, using the Virtuoso trace () function can be useful for seeing which statements signal errors. Also the profiling report produced by prof_enable () can give useful hints on execution times and error frequencies. See Profiling and prof_enable ().
For PL applications, Virtuoso provides a profiler and test coverage facility. This is activated by setting PLDebug = 2 in the Parameters section of the ini file and starting the server. The functions cov_store () and cov_report are used for obtaining a report of the code execution profile thus far. See the documentation on "Branch Coverage" for this. The execution counts of the lines for exception handler code will show the frequency of exceptions. If in a linear block of code a specific line has an execution count lower than that of the line above it, then this means that the line with the higher count has signalled as many exceptions as the difference of the higher and lower count indicates.
The times indicated in the flat and call graph profile reports for PL procedures are totals across all threads and include time spent waiting for locks. Thus a procedure that consumes almost no CPU can appear high on the list if it waits for locks, specially if this is so on multiple threads concurrently. The times are real times measured on threads separately and can thus exceed any elapsed real tiime.
Single stepping is not generally useful for debugging locking since locking issues are timing sensitive.
Previous
Data Backup & Recovery |
Chapter Contents |
Next
Contents of Data Access Interfaces |