Sybase IPG Group White Paper

Series : Sybase IQ 12 Technical Series

Index Selection Criteria

Sybase IQ 12.5

Index Selection Criteria

Richard Soundy
Sybase IPG (Analytics) Group for EMEA
November 2003

Abstract

This document details the basic steps for the determination of the required indexes for a Sybase IQ 12.5 database. It is not considered that this document is in any way a complete blueprint for the required indexes, it is intended more to be a layout or a basis for the layout.

In addition this document only seeks to provide the information for the 12.5 version of the IQ product. It is very likely that future releases will require a different choice of indexes, or maybe provide a different set of index to select from.

Initial Information Requirements

In order to determine the “best” initial fit of indexes for a specific database and application the following information will need to be derived. It is understood that for a new system some if not much of this information will not be available, and in this event the best information should be obtained.

In all cases the monitoring of the performance of the server/database/application will be required to correct any errors in the initial index determination. The tools and techniques used for this monitoring are outside the scope of this document, and are a subject for future documentation.

Table Related Information Required.

For each and every table the following information needs to be obtained:

1. Number of Rows in the table using the following bands:

i. Less than 2,000 rows

ii. Between 2,000 and 1,000,000 rows

iii. Greater than 1,000,000 rows.

The first two bands above (less than 1,000,000 rows) relates, in the majority of data warehouse designs to dimension or look-up tables, the final band to the fact or event tables.

The reasons for the split are two fold. An IQ system will only access a table of less than 2,000 rows with Default or Fast Projection Indexes^{^[1]}. Any enumerated indexes are only of use to the Optimizer, and are not used by the Run-Time Engine at all. This is not to say that the enumerated indexes are of no use, in fact to the contrary, they provide priceless information to the optimizer in planning the query. Also given the small size of the table(s) the indexes are also of similar size.

The banding of 2,000 to 1,000,000 rows is given by the relative sizes of the Low Fast (LF) and High Group (HG) index types. It is considered that these two index types are merely reflections of each other. This is very wrong. They are internally very different, have very different performance capabilities, and have differing load and storage costs. This document is not the place to go in to the structures of the HG or LF indexes, however the following two examples show the differing sizes of the two index types, which leads to the reasons behind the indexing guidelines presented further on in this document.

If we consider a table of 10,000 rows, which has a column that has a cardinality of 1,000[2], we should consider the differing storage sizes for the two index types under discussion (LF and HG).

An LF index will have 1 page of storage for the Index ID page and 1 bitmap for each and every discrete value the column is storing. For a 10,000-row table, given a page size of 218 Kbytes[3], the bitmaps will be 1 page in size. Hence the size of the LF index will be 1,001 pages. It should be noted that this will not use 64 times 1,001 Kbytes of disk space as the compression routines will shrink this to a smaller number. However to access a given value the system will have to read 2 pages (ID + 1 bitmap).

For the HG index the calculation is somewhat different. The HG will still have and ID page (which will also contain the B-Tree component of the index), but the G-Array component will be stored in some 2- 3 pages. This is because a G-Array page can hold some 8,000 row-id values; hence 10,000 row-ids (rows) can be stored in 2 to 3 pages. So the overall size of the HG index will be 4 pages. Also the access to a given value will still only be 2 I/Os (pages).

Given that the optimizer will select (all other thing being equal) the fast search index (LF or HG) that is physically smaller the optimizer will always select the HG index. Hence for a table of 10,000 rows an HG is far more efficient than an LF. There is a question of skewed data - but even this does not effect the overall size of the HG index to any great degree, as this will only add a couple of pages to the overall size of the index. The difference in the load time between an LF and an HG index is for all intents and purposes irrelevant for a table of only 10,000 rows.

The break even point on the relative size of the two index types is easily calculated for each table, but the larger the table, the more the likelihood is that it will be dynamic, and hence favour the use of the LF index for low cardinality columns.

We can continue to provide sizing costs for the two indexes, but the logic is clear, for smaller tables the Hg index should be considered to be the “index of choice”.

So where is the break point, well there is no firm “number-of-rows” at which the LF becomes a “better” index than the HG. But we would consider somewhere in the few million rows this takes place. Of course other questions need be answered, for example should the table have a very low volatility[4] of the size compression is not an issue then you might wish to use the HG for low cardinality columns up to 10 or 20 million row tables.

Column Related Information Required

For each and every column the following information should be obtained:

1. The cardinality of the column

2. The data type of the column

3. The expected use of the column

i. Simple searches (equality/inequality, EXISTS, IN)

ii. Range Searches (<, >, between)

iii. Aggregation functions (SUM(), AVG(), DISTINCT, COUNT)

iv. Intra-table comparison

v. Free Text searches (CONTAINS, LIKE)

vi. Ordering and grouping (ORDER BY, GROUP BY)

This information is required to determine the LF/HG and other index requirements. This is also deemed to be the most difficult information to gather for a new system. If the Sybase IQ database is to accelerate of replace an existing system, there should be no barrier to any of the above information.

As the volume of this information is likely to be high it is suggested that some form of schema design tool is adopted to hold the information. A product such as Sybase Power Designer is recommended for this area as it provides many other features in addition to its documentation functions.

Database Index Selection

The following list needs to be applied to each and every column in each and every table (with the possible exception of tables of less than 2,000 rows) in the database.

For each column the cardinality needs to be applied to the CREATE TABLE IQ UNIQUE constraint. This will provide the best version of the FP index (1-byte FP, 2-byte FP or Flat FP). Note that the MINIMIZE_STORAGE option will auto-generate an IQ UNIQUE value of 255 for all columns for which IQ UNIQUE is not specified.

Note: for a column that has a data type of FLOAT, REAL or DOUBLE then the only indexes that can be applied are flat FP. For a column of data type BIT there is a custom index structure automatically generated under it, hence there is no requirement for any further indexes to be placed upon it. A column of data type VARCHAR() with a length greater than 255 can only have a Flat FP and (if required) a Word (WRD) index placed on it.

If a column is a PRIMARY KEY or UNIQUE then this should be specified within the CREATE TABLE statement, this will auto generate a unique HG index on the column, allowing improved storage, better join processing and more statistics available to the optimizer.
If a column is a FOREIGN KEY then this should also be defined in the CREATE TABLE statement, also allowing automatic construction of an HG index, this is vital for both the optimizer and for join performance.
If a table has a multi-column PRIMARY or FOREIGN KEY, this can be specified, and the system will auto-generate the correct type of Multi-Column HG index on the columns (at this point it should be noted that the individual columns within the key will not have an HG index placed upon them – so it is the responsibility of the designer to ensure that if these columns are used for searches or whatever then the appropriate indexes are placed upon them).
For any and every other join column (column that is used in a join – but not in a PRIMARY KEY/FOREIGN KEY join), then this column should also have an HG index placed upon it.

At this point all of the above information applies to columns of any cardinality and to tables of any size. The following steps may require the column cardinality and the table size to be considered.

For a column that is going to be used for searching (Equality, inequality) for certain aggregations (COUNT, DISTINCT), for grouping and ordering, then either an HG or an LF index is required. For a table of less than around 1 million rows (see discussion above) a HG index is more efficient for all cardinalities. For a table greater in size than this use an LF for columns with a cardinality of less than around 1,500 and an HG of columns that have a cardinality greater than this.
For a column that has a data type of DATE, TIME or DATETIME, then it is suggested that the DATE, TIME and DTTM indexes are applied. This will provide for both range searches and date function searches (DATEPART() etc.).
For a non-date/time column that is involved in range searches and/or SUM() and AVG() aggregation functions it is suggested that an HNG index is applied. This may not be required if the column has a high percentage of NULL values (greater than 10% of the rows). Also the use of HNG columns should be monitored, as HG and LF can provide (in some cases) a faster range search capability.
Any column that is compared with another column in the same table should have a compare (CMP) index applied to the comparison pair of columns.
Any VARCHAR() or CHAR() columns that are likely to be used for word searches should have an word (WRD) index applied to them.

Conclusion

That is really all that is required for the implementation of indexes to an IQ database. Naturally there are always exceptions (for both table sizes cardinality [HG vs. LF]) and the performance of the system must be monitored to determine if further indexes are required or maybe some indexes to be removed.

For further information on the structure and function of the various index types then the Sybase IQ manual set can be consulted, in addition the Sybase Professional Services Groups can be used both as implementers and trainers in some of the more esoteric functionality of the Sybase IQ index sets.

Richard Soundy

November 2003

London.

Sybase, Inc.

Worldwide Headquarters

One Sybase Drive

Dublin, CA 94568-7902 USA

Tel: +800 8 SYBASE

www.sybase.com

[1] In this document I generally refer to Default Indexes as Fast Projection or FP indexes.

[2] The column supports the storage of 1,000 different values.

[3] The default IQ Page Size for an IQ 12.5 Database

[4] A table that is low volatility does not change very often; hence the loading time is not of issue.

Sybasehelp

Sunday, June 7, 2009