One of my old customers had a data warehouse with a three-area design model: a staging area for data load/cleanse, a third normal form (or more 3-NF-like than dimensional) raw data layer and a denormalised dimensional layer containing aggregated data for reporting. They ran a retail business and one of the key “fact” feeds was the point of sale transactions from each outlet. This was presented as two tables, a master table of who, where and when and detail table of each item scanned at the checkout linked to master table by a transaction number. In the base aggregation layer we sum up the sales to a customer/store/day/item granularity; that is, if the a customer shops twice in a day in a single store it counts as one visit, likewise if they buy six jars of coffee on the same day in a store there is only one coffee sales record in the summary (with a quantity of 6) Furthermore the products are slightly aggregated to account for promotional packs having differing barcodes but effectively the same “master” product.
When this DW was first rolled out 7 years ago the raw data was only used by one or two specialist analysts answering non-mainstream questions, and as such it was specified by the customer that no indexing would be required on the raw data. Overtime the inevitable changes in user focus occurred and a new group of users came on board that had a requirement to measure customer profitability from Point of Sale (PoS) data. These queries were a major feature in our top queries by resource and queries by execution time analysis. One way to improve performance would be to flatten the data by denormalising header and line data but the customer was keen to keep the data looking as similar to the source as possible so we went for building compressed b-tree locally partitioned indexes on the sales header (store + customer composite) and sales line (transaction number) tables. This gave a measurable performance boost for queries that looked at customer or store for given date ranges.
But one group users were interested in the purchasers of specific promotional packs and this detail was only available in the raw PoS data. They decided to try indexing the item code in the line level table and then go back to the master records to find the who and / or where. In this case the users found no change in there query time, it was still slow. So why wasn’t the index used? Looking at the steps shown in the before and after indexing query plans the plans looked remarkably similar (OK, identical) but there was a difference if you considered the costs and rows columns, with the index the costs and cardinality looked far more realistic for the data involved. So the presence of the index gave the CBO a better handle on the nature of the data but it decided that the a query plan that did not use the index was still the best way answer the question.
Of course this not mean to say we could not improve the query performance by using different statistics or other indexes, it is just meant to show that an index can change query cost estimates without being used in the query plan.
For the interested, Oracle 10gR2 on 64 bit Solaris with a 4 TB SAN based DW somewhere in the background