Posted by: Peter Scott | July 5, 2006

Techniques to reduce IO – partitioning and compression

Last time I wrote a few general notes about the storage of fact and reference data. I now will look at partitioning and compression in slightly more detail

Partitioning is a way to divide a large table into separately managed chunks; the contents of each chuck being determined by the partitioning key. Although it is possible to create a table with a single partition (and this does have uses!) most partitioned tables would have multiple partitions. With Oracle there are three possible type of partition key: Hash, Range, and List. Other database vendors also offer partitioning, most recently it was Microsoft with range partitioning in SQL Server 2005, however the partitioning from most vendors is offen restricted to one type (and usually hash)

Range and list partitions are perhaps simplest to understand; each table has a key column (or columns in the case of a compound key) rows being allocated to partitions based on the value of the key. In the case of range we store values less than the partition’s key maximum value, and for list we store rows where the key matches a value in the list for the partition. With hash partitioning the column value is first hashed to determine which partition to use. The hashing algorithm works best if there is an exact power of two partitions in a table. Composite partitioning is also possible; here the partitions are further sub-divided into sub-partitions. In Oracle 8i, sub-partitions where limited to hash, but 10g we can also use list (but not range) sub-partitions.

Partitioning schemes first and foremost allow us to improve database manageability by dividing large tables into smaller, more manageable chunks; however there is also the possibility that we can also boost performance by partitioning. Consider a query against a partitioned table; if we include the partition key in the predicate the database will know in which partition to find the required data, the other partitions will not be accessed. Similarly, if all the data to be removed from a data warehouse is held in a single partition (probably not a hash partition!) then we can simply truncate or drop a partition which is far, far quicker than deleting from a table based on a predicate.

Although hash partitioning can work well in OLTP systems it is my opinion that it is less useful in a data warehouse. As we are not normally aware of which partition contains which key values we are unlikely to be able to exploit any of the partition manipulation operations such as partition exchange or partition truncation as part of our data load process. We are also unlikely to achieve much partition elimination for queries that involve more than one partition key value and in the real world people often query on related groups of items such as the cities in a region and the hashing algorithm would most likely spread the data across many partitions. A further limitation of hash partitioning is that we are unlikely to be able to set the tablespace for a partition to be read only as it is unlikely that we could ensure that newly arrived data would belong in that partition

The concept behind partitioning has been part of Oracle for a long while; Oracle 8i introduced the first incarnation of true partitioned tables. Before that Oracle 7.3 had Partition Views. Partition views used multiple physical tables of the same design to hold the data and used a UNION ALL view over the set of tables to create the partition view; each table only contained data relevant to the partition and this was typically enforced using check constraints. The query optimiser knew about this construction and would only access the partitions that contained the data to answer the query. Although the database setting PARTITION_VIEWS_ENABLED=TRUE was no longer supported from Oracle 10gR1 it seems that by default this value is treated as true, that is, partition views still work. However partition views are not in anyway as flexible as partitioned tables and should certainly not be considered for new applications.

Table Compression is another good way to reduce IO. The rational behind this is that we are reading lots of rows from a table and therefore the more rows we can read in one hit the better as the overhead to uncompress the rows is less than overhead in reading the uncompressed rows from disk. Table compression works with both tables and partitions (you can even mix compressed and uncompressed partitions within a single table). Unlike index compression, table compression only works on bulk insert; you can still use non-bulk techniques to insert into a compressed table but the data inserted will not be compressed, and updating rows also cause those rows to be uncompressed. Compression is organised by database blocks and relies on the detection of repeating data patterns within the block. The repeating patterns are tokenised and stored in a look-up area within the block and the data itself is replaced with the tokens. Enhanced compression can be obtained by ordering the data being inserted to maximise the number of tokens being replaced.

About these ads

Responses

  1. I’m with you on the hash partitioning — I’ve not found it to be very useful, mostly because I’m generally going to range partition a fact table and then choose between list and hash subpartitioning. In such a case I can almost always find a column that can be usefully list subpartitioned so as to provide further options for the CBO to do a bit of pruning. Hash partitioning is more of a load balancing or join-enhancing technique and neither of those are of as much value, IMHO.

  2. Hey Pete — you made the top of the “SQL Server Topics” list in today’s SSWUG email. Must have been the one-line mention of Sql Server 2005. :-)

    Nice article, and timely for my current tasks.

  3. I like you website so much…
    Learning from you…
    could you please give some sample on how list patitioning on DW project?

    thanks so much,
    liyan

  4. Sometimes you want to arrange data into partitions based on a key that does not fit into ranges for example countries could be partitioned by continent, but there are no simple algorithm to say that Belgium is Europe and Borneo is in Asia. In cases like this we can use list partitions to define which known value of partition key fits into which partition. This can give great performance for queries that select many key values from the same partition – for my example we could look at sales in Europe and only look at the Europe partition. But one downside of list partitions is that the key must be something that does not move about a lot – countries rarely change continent but people can change town, so basing a list on a volatile attribute is not such a good idea


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: