Another TSQL Challenge : How To Handle Breaks !

Add a comment

Beyondrelational published their new TSQL Challenge #9, and as usual Antoine Gémis translated it in french.

They have taken redefined the goal about those challenges which I think is a very nice move.

The goal of TSQL Challenges is to help people enhance their SET based query writing skills.
Most of the times, a SET based query performs better than a row-by-row (often referred as RBAR – Row By Agonizing Row).
Most people find trouble writing set based operations because they are more familiar with procedural programming model. It takes some practice and exercises to develop the SET based thinking and query writing skills.

If you don't work on a regular basis on databases, I can say the amount of TSQL or PL/SQL poorly written that would run smoothly in straight SQL is far for negligible.

Back to the topic, this time we are asked to create breaks from a table.
From :

ID          CreationDate            Content    SendState AckState
----------- ----------------------- ---------- --------- --------
1           2009-05-27 22:15:43.647 Msg #1     0         0
2           2009-05-28 00:39:43.647 Msg #2     0         0
3           2009-05-28 03:03:43.647 Msg #3     1         1
4           2009-05-28 05:27:43.647 Msg #4     1         1
5           2009-05-28 07:51:43.647 Msg #5     1         1
6           2009-05-28 10:15:43.647 Msg #6     1         0
7           2009-05-28 12:39:43.647 Msg #7     1         0
8           2009-05-28 15:03:43.647 Msg #8     1         0
9           2009-05-28 17:27:43.647 Msg #9     1         0
10          2009-05-28 19:51:43.647 Msg #10    1         1


To :

FirstIdInclusive LastIdInclusive SendState AckState
---------------- --------------- --------- ---------------
1                2               0         0
3                5               1         1
6                9               1         0
10               10              1         1

I've solved this by finding rows where a break in the couple (SendState, AckState) appears.
From here i tried two solutions : one is to find the "breakID", and then self joining it with handling the first or last output of the query.

The other one is to find the break before, assign them to a group, then find the break after and assign them to the same group, and join those two sets.
Performance wise, it's about the same, so I'll first publish the second one here even if the query is a bit longer, I think it's easier to understand the concept behind.

I'm using two common table expressions to handle those special rows. As you're about to see, to find the breaks I'm just doing an existence predicate on the same (SendState, AckState) but with the last or next ID. If the couple is not the same, there is a break !

WITH FirstID(Grp, ID, SendState, AckState) AS
(
SELECT
    row_number() over(ORDER BY m.ID ASC),
    m.ID,
    m.SendState,
    m.AckState
FROM
    @tc9 AS m
WHERE
    NOT EXISTS (SELECT NULL FROM @tc9 AS m1
                 WHERE m1.ID = m.ID-1
                   AND m1.SendState = m.SendState
                   AND m1.AckState  = m.AckState)
),
     LastID(Grp, ID) AS
(
SELECT
    row_number() over(ORDER BY m.ID ASC),
    m.ID
FROM
    @tc9 AS m
WHERE
    NOT EXISTS (SELECT NULL FROM @tc9 AS m1
                 WHERE m1.ID = m.ID+1
                   AND m1.SendState = m.SendState
                   AND m1.AckState  = m.AckState)
)
SELECT
    f.ID AS FirstIdInclusive,
    l.ID AS LastIdInclusive ,
    f.SendState,
    f.AckState
FROM
    FirstID AS f
    INNER JOIN LastID AS l
      ON f.Grp = l.Grp;

I ran some tests on a 1 million rows table and another 10 millions rows table on my home PC with SQL Server 2008 Express.
By filtering the display part using a count(*), I'm quite happy with the times involved.

Only 3 seconds for the 1 million rows table, and around 75 seconds for the 10 million rows table.

If I display the output, it takes 8 secondes against the 1 million rows table and 130 against the 10 millions rows one, almost twice, but with five million rows to display I'm still happy with the performances.

Ok, now here is the shorter query BreakID solution :

WITH BreakID(Grp, ID, SendState, AckState) AS
(
SELECT
    row_number() over(ORDER BY m.ID ASC),
    m.ID,
    m.SendState,
    m.AckState
FROM
    @tc9 AS m
WHERE
    NOT EXISTS (SELECT NULL FROM @tc9 AS m1
                 WHERE m1.ID = m.ID+1
                   AND m1.SendState = m.SendState
                   AND m1.AckState  = m.AckState)
)
SELECT
    coalesce(f.ID+1, (SELECT min(ID) FROM @tc9)) AS FirstIdInclusive,
    l.ID AS LastIdInclusive ,
    l.SendState,
    l.AckState
FROM
    BreakID AS f
    RIGHT OUTER JOIN BreakID AS l
      ON f.Grp = l.Grp-1;

I could have used the fact that ID is an identity column with 1 as first value and replacing the select min(ID) in the coalesce function, but this way the solution has wider utility.

Using an inner join instead an outer join, one row will not show up.
Trying to union all just one more row in the CTE to achieve an inner join make the explain plan go from merge join to nested loops, and the performance became very very ugly (30 seconds to display the first 250 rows of the expected 500k rows from the 1M table, I canceled).

Later I'll update this topic to provide the Oracle's equivalent of the query.
The query should be easier to write thanks to lead and lag analytic functions.


Share and Enjoy:
  • Digg
  • del.icio.us
  • Google Bookmarks
  • Facebook
  • TwitThis
  • blogmarks
  • email
  • Furl
  • LinkedIn

This entry is filed under SQLing. And tagged with , , , . You can follow any responses to this entry through RSS 2.0. You can leave a response, or trackback from your own site.

  1. No Comments