Using Analytics To Handle Breaks
When trying to solve last T-SQL Challenge I did find a query which fits well in both SQL Server and Oracle (with some syntax tweaking) and seems quite effective to me even on a ten millions rows table.
I said I'll try to use the more advanced analytics function of Oracle to solve the problem, and it was a bit harder that I first thought.
With the help of Tomas "MyMaster" Kytes I could find ideas and piece of code which are a more generic treatment of the issue, so it deserves it's own post.
Remember the sample and the desired output :
----------- --------- --------
1 0 0
2 0 0
3 1 1
4 1 1
5 1 1
6 1 0
7 1 0
8 1 0
9 1 0
10 1 1
FirstId LastId SendState AckState
------- ------ --------- ---------
1 2 0 0
3 5 1 1
6 9 1 0
10 10 1 1
The solution I used on last post was to find rows where a break happens and from there find the other Id with a self outer join.
The solution I'm about to use is to assign a number to every break, to be able to do an easy aggregation on this.
Even is this solution use SQL:2008 functions and therefore is plain ANSI SQL, it won't work on SQL Server 2008 (probably in a further release) and probably on most RDBMS.
One can use an outer join to achieve the lag function, and a scalar max to achieve the cumulative max, however performances won't be good even on a 1000 rows tables.
The two things I really like in this solution, behind the use of analytics, are :
- only one table scan
- only one line of code for specifying the breaking rule
I will use CTE to split the things I'm doing, but as there is no subquery factoring it can be achevied in a classic "select from (select from (select from))".
(
SELECT
T.id,
T.sendstate,
T.ackstate,
case
when T.id - lag(T.id, 1, T.id) over(partition BY T.sendstate, T.ackstate ORDER BY T.id ASC) <> 1
-- This is the rule, here i'm saying i want last row for the same (sendstate, ackstate) to be same id minus one.
then row_number() over(ORDER BY T.id ASC) -- match with T.id in this example
-- Give a unique id on the break, null if not
end AS brk
FROM tc9_ora T
)
SELECT * FROM B;
1 0 0 1
2 0 0
3 1 1 3
4 1 1
5 1 1
6 1 0 6
7 1 0
8 1 0
9 1 0
10 1 1 10
After that we have to make the break_id slide against following nulls.
Again, analytic is perfect for this :
(
SELECT
B.id,
B.sendstate,
B.ackstate,
-- last_value(B.brk ignore nulls) over(order by B.id asc) as grp -- also works
max(B.brk) over(ORDER BY B.id ASC) AS grp
FROM B
)
SELECT * FROM G;
1 0 0 1
2 0 0 1
3 1 1 3
4 1 1 3
5 1 1 3
6 1 0 6
7 1 0 6
8 1 0 6
9 1 0 6
10 1 1 10
And from there it's a basic aggregate on grp, sendstate, ackstate :
min(G.id) AS first,
max(G.id) AS last,
G.sendstate,
G.ackstate
FROM G
GROUP BY
G.grp,
G.sendstate,
G.ackstate;
1 2 0 0
3 5 1 1
6 9 1 0
10 10 1 1
For an easy copy / paste, here is the full query :
( -- Assigning break ID on first break
SELECT
T.id,
T.sendstate,
T.ackstate,
case
when T.id - lag(T.id, 1, T.id) over(partition BY T.sendstate, T.ackstate ORDER BY T.id ASC) <> 1
then row_number() over(ORDER BY T.id ASC)
end AS brk
FROM tc9_ora T
),
G AS
( -- Assigning that break ID to the whole break
SELECT
B.id,
B.sendstate,
B.ackstate,
last_value(B.brk IGNORE nulls) over(ORDER BY B.id ASC) AS grp
FROM B
)
SELECT -- Basic aggregation
min(G.id) AS first,
max(G.id) AS last,
G.sendstate,
G.ackstate
FROM G
GROUP BY
G.grp,
G.sendstate,
G.ackstate
ORDER BY 1 ASC
Note that this query is about 25% slower that the past solution on big tables, probably due to the window sort of analytics have to be achieved on disk instead of memory after a certain table size.
However, it's way more adjustable to more problematics.
Be sure to check the related Tom Kytes article on Oracle Magazine (last section).
This entry is filed under Datawarehousing, SQLing. And tagged with analytic functions, Challenges, Oracle, Tom Kytes. You can follow any responses to this entry through RSS 2.0. You can leave a response, or trackback from your own site.