Using Analytics To Handle Breaks

Add a comment

When trying to solve last T-SQL Challenge I did find a query which fits well in both SQL Server and Oracle (with some syntax tweaking) and seems quite effective to me even on a ten millions rows table.

I said I'll try to use the more advanced analytics function of Oracle to solve the problem, and it was a bit harder that I first thought.
With the help of Tomas "MyMaster" Kytes I could find ideas and piece of code which are a more generic treatment of the issue, so it deserves it's own post.

Remember the sample and the desired output :

ID          SendState AckState
----------- --------- --------
1           0         0
2           0         0
3           1         1
4           1         1
5           1         1
6           1         0
7           1         0
8           1         0
9           1         0
10          1         1

FirstId LastId SendState AckState
------- ------ --------- ---------
1       2      0         0
3       5      1         1
6       9      1         0
10      10     1         1

The solution I used on last post was to find rows where a break happens and from there find the other Id with a self outer join.
The solution I'm about to use is to assign a number to every break, to be able to do an easy aggregation on this.

Even is this solution use SQL:2008 functions and therefore is plain ANSI SQL, it won't work on SQL Server 2008 (probably in a further release) and probably on most RDBMS.
One can use an outer join to achieve the lag function, and a scalar max to achieve the cumulative max, however performances won't be good even on a 1000 rows tables.

The two things I really like in this solution, behind the use of analytics, are :

  • only one table scan
  • only one line of code for specifying the breaking rule



I will use CTE to split the things I'm doing, but as there is no subquery factoring it can be achevied in a classic "select from (select from (select from))".

WITH B AS
(
SELECT
    T.id,
    T.sendstate,
    T.ackstate,
    case
      when T.id - lag(T.id, 1, T.id) over(partition BY T.sendstate, T.ackstate ORDER BY T.id ASC) <> 1
-- This is the rule, here i'm saying i want last row for the same (sendstate, ackstate) to be same id minus one.
      then row_number() over(ORDER BY T.id ASC) -- match with T.id in this example
-- Give a unique id on the break, null if not
    end AS brk
FROM tc9_ora T
)
SELECT * FROM B;

1    0    0    1
2    0    0   
3    1    1    3
4    1    1   
5    1    1   
6    1    0    6
7    1    0   
8    1    0   
9    1    0   
10   1    1    10

After that we have to make the break_id slide against following nulls.
Again, analytic is perfect for this :

WITH    G AS
(
    SELECT
        B.id,
        B.sendstate,
        B.ackstate,
--        last_value(B.brk ignore nulls) over(order by B.id asc) as grp -- also works
        max(B.brk) over(ORDER BY B.id ASC) AS grp
    FROM B
)
SELECT * FROM G;

1    0    0    1
2    0    0    1
3    1    1    3
4    1    1    3
5    1    1    3
6    1    0    6
7    1    0    6
8    1    0    6
9    1    0    6
10   1    1    10

And from there it's a basic aggregate on grp, sendstate, ackstate :

SELECT
    min(G.id) AS first,
    max(G.id) AS last,
    G.sendstate,
    G.ackstate
FROM G
GROUP BY
    G.grp,
    G.sendstate,
    G.ackstate;

1       2      0         0
3       5      1         1
6       9      1         0
10      10     1         1

For an easy copy / paste, here is the full query :

WITH B AS
( -- Assigning break ID on first break
SELECT
    T.id,
    T.sendstate,
    T.ackstate,
    case
      when T.id - lag(T.id, 1, T.id) over(partition BY T.sendstate, T.ackstate ORDER BY T.id ASC) <> 1
      then row_number() over(ORDER BY T.id ASC)
    end AS brk
FROM tc9_ora T
),
    G AS
( -- Assigning that break ID to the whole break
    SELECT
        B.id,
        B.sendstate,
        B.ackstate,
        last_value(B.brk IGNORE nulls) over(ORDER BY B.id ASC) AS grp
    FROM B
)
SELECT -- Basic aggregation
    min(G.id) AS first,
    max(G.id) AS last,
    G.sendstate,
    G.ackstate
FROM G
GROUP BY
    G.grp,
    G.sendstate,
    G.ackstate
ORDER BY 1 ASC

Note that this query is about 25% slower that the past solution on big tables, probably due to the window sort of analytics have to be achieved on disk instead of memory after a certain table size.

However, it's way more adjustable to more problematics.
Be sure to check the related Tom Kytes article on Oracle Magazine (last section).

Share and Enjoy:
  • Digg
  • del.icio.us
  • Google Bookmarks
  • Facebook
  • TwitThis
  • blogmarks
  • email
  • Furl
  • LinkedIn

This entry is filed under Datawarehousing, SQLing. And tagged with , , , . You can follow any responses to this entry through RSS 2.0. You can leave a response, or trackback from your own site.

  1. No Comments