Posted by: Peter Scott | April 14, 2007

Constraints

Constraints do just that; when enforced they constrain data such that rules are obeyed. But even when not enforced the existence of an Oracle RELY constraint does something really useful, it tells the CBO, and especially the query rewrite engine, that certain facts about the data can be presumed to be true. Of course, if we don’t tell the truth to the database we can into all sorts of problems.

But why should we want to tell the database to trust us, wouldn’t it be better to let the database examine the data and use, real, enforced constraints? Real unique constraints are enforced by b-tree indexes, and for large fact tables, this can be a problem; typically we would need an index based on each dimension column in the fact table and furthermore for a partitioned table this should be a global rather than local index. In one of my systems we have a 600 million row table spread over 800 partitions and each row is uniquely defined by a combination of 10 dimensional attributes. Having an index of that size is no joke – we need masses of storage; we either slow data load as each item is checked against the constraint or we spend time rebuilding the constraint post data load. And to what purpose, it is not as if we would often run queries that target specific rows by way of the composite index.

For my money, real constraints are used in the staging process to validate data; when the data is proven clean and published to main data warehouse it is the time for the rely constraints to take over.

About these ads

Responses

  1. Interesting feature. It is probably most welcome in the Java crowd that prefers to manage data integrity in the application (instead of the DB).

  2. I’ll happily go along with this in a read-only warehousing environment, which is what it appears you’re referring to. When we get to transactional apps my paranoia begins to kick in; I don’t trust the middle- and client-tire programs to consistently enforce the rules. I get particularly nervous when I think about an evolving application in which maintenance programmers ‘adjust’ the code 5 years down the road.

  3. Beth, yes I am talking about data warehouses – should have linked back to the original post to make that clearer.

    Rob – as Beth said, can you really trust a developer to manage programmatic constraints… ops, I used to be a developer so perhaps shouldn’t have said that ;-)

  4. I wasn’t making a statement whether it was desirable, just that it would be welcome from the crowd that does trust the programmer. (For what it’s worth, no, I’m not in that crowd).

  5. Saw an interesting point in an analytics session here in Vegas. If your data tends to come in with lots of constraint violations like duplicate keys, it can be faster to scrub them in staging tables with with analytics than go through the “rejects table” routines.

  6. From comments I received, both here and in private email, I must emphasise that RELY is a way of telling the database “trust me, I am telling you that this is how it is.” If we lie to the database, then don’t be surprised if features such as materialized view query rewrite selects inappropriate summary tables.

    Rob Vollman’s comment about desirability of use in a transactional environment is very true (but there again he is a pragmatic guy) – just because you can do something does not mean it is the right thing to do

    Joel – wish I had heard that talk. I am still a bit “old school” on reference loads, I disable constraints, load data and then re-enable the constraints with an exceptions into clause. But for fact load I use outer joins and look for nulls (for foreign keys) or not nulls (for duplicates) – but the analytics could be a good way to go here

  7. Does this mean you still maintain the B-tree unique index in your staging area on the 600 million row table spread over 800 partitions, uniquely defined by a combination of 10 dimensional attributes? By IOT? You just switch to RELY constraint on the main DW?

  8. DJ – For staging I use a lightweight check and not the whole history of the database. We have a business rule where we only load data that is no more than six weeks old – this immediately reduces the size of the data set to check against by almost 95%. We tend to represent this as a IOT copy of the key columns. Once we prove the data to be clean we publish to the main DW tables (raw data & summary tiers) and use RELY contraints for then on

  9. Thanks for the reply. It’s nice to know what the industry experts are doing. Keep up the good working.

  10. [...] Pete Scott: Kurze Notiz zu einem DWH-Szenario [...]

  11. [...] Pete Scott: Kurze Notiz zu einem DWH-Szenario [...]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: