Posted by: Peter Scott on: April 10, 2007
Jonathan Lewis posts a nice example (with a subsequent comment from Jeremy expanding it to work well with RAC) of an use of the analytic LEAD function.
Which reminds me of a gotcha with LAG or LEAD with sparse data. Sparse data is commonly found in business intelligence databases at low levels of aggregation. Suppose we are looking at day-on-day changes in sales amount for “blue splurge widgets”, but on some days we don’t sell any at all.
As a simple example let’s make a table of date and number of blue splurge widgets sold:
CREATE TABLE "WAREHOUSE"."PS_LAG_1" ( "SALES_DATE" DATE, "QUANTITY" NUMBER ) ;
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('01-APR-07', 'DD-MON-RR'), '2');
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('02-APR-07', 'DD-MON-RR'), '3');
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('04-APR-07', 'DD-MON-RR'), '5');
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('05-APR-07', 'DD-MON-RR'), '5');
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('06-APR-07', 'DD-MON-RR'), '3');
COMMIT;
NB we made no sales on the April 3.
Using the LAG function to calculate change in sales from the previous day we get:
select sales_date, quantity, quantity - lag(quantity) over (order by sales_date) delta
from ps_lag_1;
SALES_DATE QUANTITY DELTA
——— ——— —–
01-APR-07 2
02-APR-07 3 1
04-APR-07 5 2
05-APR-07 5 0
06-APR-07 3 -25 rows selected
But did we really sell two units more on April 4 than we did on April 3? The problem here is that lag and lead always measure offsets in terms of ROWS and we need to measure our offset in terms of RANGE. But all is not lost we can use an analytic function over a range window consisting of a single row; we just need to use the BETWEEN construction and have the same value for the window start and end. In this example I will use the SUM() analytic function.
select sales_date, quantity, quantity - SUM(quantity) over (order by sales_date range between interval '1' day preceding and interval '1' day preceding ) delta
from ps_lag_1;
SALES_DATE QUANTITY DELTA
———– ————- ————-
01-APR-07 2
02-APR-07 3 1
04-APR-07 5 <= Look, a NULL!
05-APR-07 5 0
06-APR-07 3 -25 rows selected
Now we get a NULL for the calculation that would include the missing day. Of course how we deal with the NULL is up to us, we just need a NVL (or CASE) to wrap up the correct business logic.
It seems that code was broken in above comment. It should be:
select sal_dt, quantity, nvl(quantity,0) – nvl(SUM(quantity) over (order by sal_dt range between interval ‘1′ day preceding and interval ‘1′ day preceding ),0) delta
from ps_lag_1,
(select (select min(sales_date) from ps_lag_1)+level-1 sal_dt
from dual connect by level
Hope this time it fill work:
select sal_dt, quantity, nvl(quantity,0) - nvl(SUM(quantity) over (order by sal_dt range between interval '1' day preceding and interval '1' day preceding ),0) delta
from ps_lag_1, (select (select min(sales_date) from ps_lag_1)+level-1 sal_dt
from dual connect by level <= (select max(sales_date)-min(sales_date)+2 from PS_LAG_1))
where sal_dt = sales_date(+);
but another approach (Oracle 10g) is to use a partition outer join.
Wouldn’t you still use some kind row-generator source to join with?
April 10, 2007 at 8:26 pm
Well, but still on 3rd April sales was 3 less than on second. I would rather use row generator to fill in gaps.
select sal_dt, quantity, nvl(quantity,0) - nvl(SUM(quantity) over (order by sal_dt range between interval '1' day preceding and interval '1' day preceding ),0) delta
from ps_lag_1,
(select (select min(sales_date) from ps_lag_1)+level-1 sal_dt
from dual connect by level that gives:
07/04/01 2 2
07/04/02 3 1
07/04/03 -3
07/04/04 5 5
07/04/05 5 0
07/04/06 3 -2
07/04/07 -3
Regards,
Paweł