Posted by: Peter Scott | June 29, 2006

DW Design (2) – staging data

As mentioned yesterday, the staging area of the data warehouse has three functional uses:

  • It is the initial target for data loads from source systems
  • It validates the incoming data for integrity
  • It is the data source for information to be published the ‘user visible” layers of the data warehouse

Optionally, it may also be where the logic to transform incoming data is applied.

Data loading: Early Oracle data warehouses used one of two techniques to populate the staging layer, reading in flat files and loading them with, say SQL Loader or directly querying source systems through database links. Although these methods are still widely used, subsequent versions of Oracle have offered more sophisticated approaches. SQL Loader advanced to have direct path load options and other techniques became available. From Oracle 9 onwards it became possible to query flat files directly from the database and technologies developed (or enhanced) for database replication and recovery added some very useful techniques for capturing incoming data; in particular log mining and transportable tablespaces offer some powerful features within data warehouse data load.

Once loaded into staging tables it is a good idea to check the validity of the data. Minimally, we check for duplicated data and that all of the foreign keys have parents, this can be further extended range check attribute values. For reference data I favour checking that the whole of loaded reference set is self-consistent; that is all hierarchical relationships (such a state has a country as a parent) are present. For fact data I favour validating against the already published data in the raw data layer. Using enabled foreign keys for referential integrity checking may not be the best approach for performance. One technique I have used in the past is to load the stage table with all of its constraints disabled and then to enable the constraints with and exceptions into clause, this captures the rowids of the rows that fail validation. Simplistically we then copy the rows that fail validation into a ‘rejects’ table and delete them from stage table, remember though it quite possible for a row to reject for more than one reason, it is also good practice to datestamp the rejected rows and give a reason for rejection, if only to help support staff find our what is wrong with the data!. Another technique to validate data is to outer join the stage table with its foreign keys and then use a multi-table insert to separate rejected data (which has null values in the joined key columns) from the valid.

Normally we only publish data that does not already exist in the data warehouse. For reference data this would imply we look for changes between the already published and the newly staged. This can be achieved using the SQL MINUS operator followed by an upsert merge of the resulting set of changed and new records. If we are tracking slowly changing dimensions we would insert both the new and changed records with some form of date stamp to mark the effective date. Most DW systems would not delete reference items where they no longer exist in the incoming feed, simply because there may be historical fact that refers to it.

In the case of fact data we would generally insert it into the historic fact tables. But there are a wealth of techniques we can us to this and often they depend on the design we have adopted for the raw fact layer. I’ll look at this next time.

About these ads

Responses

  1. hi
    can u give me some idea about the single stage and multi stage data warehouses along wid a diagram(if any). U can also send me the regarding info. at my mail Id: back2mohit@yahoo.co.in
    Regards
    Mohit

    • Host Based single-stage (LAN) Datawarehouses
      With a LAN-based warehouse, data delivery can be managed either centrally or from the workgroup environment so that business groups can meet and manage their own information needs without burdening centralized IT resources.
      Limitations/challenges:
      LAN-based warehousing solutions are normally limited by both DBMS and hardware scalability factors.
      Many LAN based enterprises have not implemented adequate job scheduling, recovery management, organized maintenance, and performance monitoring procedures to support robust warehousing solutions.
      Often these warehouses are dependent on other platforms for source data. Building an environment that has data integrity, recoverability, and security needs careful design, planning and implementation. Otherwise, synchronisation of changes and loads from sources to server could cause innumerable problems.

      2 Multistage Datawarehouses
      This configuration is well suited to environments where endusers in different capacities require access to both summarized data for up-to-the-minute tactical decisions as well as summarized, cumulative data for long-term strategic decisions. Both ODS (Operation Data Store) and the data warehouse may reside on host-based on LAN-based databases, depending on volume and usage requirements. Typically the ODS stores only the most recent records. The data warehouse stores the historical evolution of the records.

  2. hello,
    could u please send me details of various types of datawarehouses i.e. host based,single stage,LAN based ,multistage,distributed,virtual datawarehouses
    along with diagram
    u can send me this on my e-mail id: monika04verma@gmail.com


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: