At regular intervals, be they monthly, weekly, daily or even every two minutes, additional fact information arrives into a data warehouse. After validation in the staging layer this information needs to be combined with the existing historical fact. Similarly new and changed reference records needs to be assimilated with existing data. Reference data storage is not normally a great problem within a data warehouse; however there are more significant challenges to be met with fact data.
Store it or loose it is a self-evident truth; as soon as some form of aggregation is applied to factual data before committing to historical storage that detail is lost; if your transaction feed identifies transactions by the minute and your base aggregation is by the day then it will be impossible to answer queries that require a time resolution of finer than whole days. This need to store un-aggregated fact (or aggregated to the least degree compatible with business requirements) is one of the key design points of a data warehouse. Fact tables can be massive, far larger than most encountered in OLTP work. But they have several characteristics that enable us develop strategies to effectively manage them within the data warehouse.
Most DWs are date based, that is most queries look at fact measures for a given date or set of dates. They are also loaded in time based chunks, true data can arrive late, but in most cases we only load a few at most transactional days of data in the batch, older data remains unchanged. This leads us to think that reloading the whole of our stored history at each load is probably a bad idea. One feature of Oracle is the ability to partition tables by range, and for us, range on date is a great strategy in terms of data management:
- We can readily drop partitions of data that are ‘too old’ to be of further interest to DW users
- We can confine our data load manipulations to subset (maybe just one) of the available partitions in the data warehouse
- We can apply various techniques to the ‘non-volatile’ partitions to improve their manageability. These include compression of data to minimise storage requirements (and quite possibly boost query performance by reducing the IO overhead – we get more rows per read!) and setting tablespaces to read only which may have a beneficial impact on backup strategy
Some data warehouses do not drop old data, they just continue to grow. This of course leads us to possibilities around data lifecycle management and the use of slower storage for less used data; I am not convinced by this approach as often we have large amounts of unused storage from our attempts to maximise IO by having a large number of disks. I feel that seldom accessed data could just as well reside in unused areas of the main DW storage as they could in separate slow storage.
Next I’ll look at partitioning and compression in slightly more detail