Pete Scott’s random notes

Lag, lead and sparse data

Posted by: Peter Scott on: April 10, 2007

Jonathan Lewis posts a nice example (with a subsequent comment from Jeremy expanding it to work well with RAC) of an use of the analytic LEAD function.

Which reminds me of a gotcha with LAG or LEAD with sparse data. Sparse data is commonly found in business intelligence databases at low levels of aggregation. Suppose we are looking at day-on-day changes in sales amount for “blue splurge widgets”, but on some days we don’t sell any at all.

As a simple example let’s make a table of date and number of blue splurge widgets sold:
CREATE TABLE "WAREHOUSE"."PS_LAG_1" ( "SALES_DATE" DATE, "QUANTITY" NUMBER ) ;
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('01-APR-07', 'DD-MON-RR'), '2');
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('02-APR-07', 'DD-MON-RR'), '3');
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('04-APR-07', 'DD-MON-RR'), '5');
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('05-APR-07', 'DD-MON-RR'), '5');
INSERT INTO "PS_LAG_1" (SALES_DATE, QUANTITY)
VALUES (TO_DATE('06-APR-07', 'DD-MON-RR'), '3');
COMMIT;

NB we made no sales on the April 3.
Using the LAG function to calculate change in sales from the previous day we get:
select sales_date, quantity, quantity - lag(quantity) over (order by sales_date) delta
from ps_lag_1;

SALES_DATE  QUANTITY   DELTA
———   ———  —–
01-APR-07   2
02-APR-07   3          1
04-APR-07   5          2
05-APR-07   5          0
06-APR-07   3          -2

5 rows selected

But did we really sell two units more on April 4 than we did on April 3? The problem here is that lag and lead always measure offsets in terms of ROWS and we need to measure our offset in terms of RANGE. But all is not lost we can use an analytic function over a range window consisting of a single row; we just need to use the BETWEEN construction and have the same value for the window start and end. In this example I will use the SUM() analytic function.
select sales_date, quantity, quantity - SUM(quantity) over (order by sales_date range between interval '1' day preceding and interval '1' day preceding ) delta
from ps_lag_1;

SALES_DATE     QUANTITY       DELTA
———–    ————- ————-
01-APR-07      2
02-APR-07      3             1
04-APR-07      5              <= Look, a NULL!
05-APR-07      5             0
06-APR-07      3             -2

5 rows selected

Now we get a NULL for the calculation that would include the missing day. Of course how we deal with the NULL is up to us, we just need a NVL (or CASE) to wrap up the correct business logic.

6 Responses to "Lag, lead and sparse data"

Well, but still on 3rd April sales was 3 less than on second. I would rather use row generator to fill in gaps.

select sal_dt, quantity, nvl(quantity,0) - nvl(SUM(quantity) over (order by sal_dt range between interval '1' day preceding and interval '1' day preceding ),0) delta
from ps_lag_1,
(select (select min(sales_date) from ps_lag_1)+level-1 sal_dt
from dual connect by level that gives:

07/04/01 2 2
07/04/02 3 1
07/04/03 -3
07/04/04 5 5
07/04/05 5 0
07/04/06 3 -2
07/04/07 -3

Regards,
Paweł

It seems that code was broken in above comment. It should be:
select sal_dt, quantity, nvl(quantity,0) – nvl(SUM(quantity) over (order by sal_dt range between interval ‘1′ day preceding and interval ‘1′ day preceding ),0) delta
from ps_lag_1,
(select (select min(sales_date) from ps_lag_1)+level-1 sal_dt
from dual connect by level

Hope this time it fill work:

select sal_dt, quantity, nvl(quantity,0) - nvl(SUM(quantity) over (order by sal_dt range between interval '1' day preceding and interval '1' day preceding ),0) delta
from ps_lag_1, (select (select min(sales_date) from ps_lag_1)+level-1 sal_dt
from dual connect by level <= (select max(sales_date)-min(sales_date)+2 from PS_LAG_1))
where sal_dt = sales_date(+);

Thanks Paweł
The original post was spurred by a notion that lag and lead are row based and not range based.

Filling in the gaps to “put in the missing zeros” is indeed probably the way most people would go for this case – row generators are one one way (as you rightly show) but another approach (Oracle 10g) is to use a partition outer join.

but another approach (Oracle 10g) is to use a partition outer join.
Wouldn’t you still use some kind row-generator source to join with?

Alex
Thanks for the correction – I should have been clearer in that I was referring to a synthetic row generator source such as those cited by Paweł on his blog; you are quite right I would have to have a list of available date for the join – often I use a calendar table; guess that’s sticking to my Star Schema roots.

Leave a Reply