Archive for September, 2009

Writting Bad Query To Achieve Better Query

Add a comment

That title is sounding odd I admit it.

In the french forums développez.net there is a regular mistake that I try to emphasize on each time I read it, it's the use of GROUP BY without any aggregate functions to perform a regular DISTINCT operation.

Here's a very simple example (if you're not working on Oracle just get rid of the from dual) :

WITH GBY AS
(
SELECT 'A' AS col1, 'B' AS col2 FROM dual union ALL
SELECT 'A'        , 'B'         FROM dual union ALL
SELECT 'A'        , 'B'         FROM dual
)
SELECT col1, col2
  FROM GBY;

COL1 COL2
A    B
A    B
A    B

(I said very simple)

With GROUP BY and DISTINCT, both query gives me the same result : Continue reading 'Writting Bad Query To Achieve Better Query'

Track Source At Column Level Inside Your Datawarehouse

Add a comment

Every one pretty much knows that code have to be commented.
In C, C++, Java, whatever your favorite language is.

In most ETL dataflow I've encountered, I've always had to reverse everything to know where the data came from.

Sometimes I can find comments in the code, but often I can't.

When working on feeding a datawarehouse (not speaking about datamarts here), the data are matched and cleansed, but except those crucial steps, most of the columns are going down straightforward.

Nevertheless, the same question remains : WHERE my data came from, WHAT is feeding them ?

So where do you comment your code ?
You can mostly comment inside :

  • documentation
  • the ETL process
  • the database

Continue reading 'Track Source At Column Level Inside Your Datawarehouse'

TSQL Challenge #13 Break The Batches, Pacmann Arithmetics Around Case

2 Comments

TSQL Challenge #13

My previous post was a very long one and it took me several hours just to write it properly, plus many hours to do the findings (which was a part of my day job, but most of the tests were done at home).

Having zero comment is a little frustrating to be honest, I hope my fellow readers just wait for part two !
I don't have yet an ETA for this, I plan it to be released in September - don't hack this site if it doesn't happen, as everyone I have a life to deal with !

That's enough for the whining part, in fact I just wanted to say I put more energy than usual in it and I didn't even write about T-SQL Challenge #13 !

Well, the rules are hard to sum up in three sentences so feel free to read the whole explanation over the official website.

For the short version, you have to create set of invoices (having multiples rows between one and ten) that can't be split between multiple sets.
A set can't possess more than ten rows but have to possess the maximum number of invoices.
The set number is reinitialized to one for every batch.

Every rows and columns from this following dataset are known except the set column which is the one to be found : Continue reading 'TSQL Challenge #13 Break The Batches, Pacmann Arithmetics Around Case'

Jaro-Winkler’s Algorithm Part One, Oracle utl_match Built-in Function

1 Comment

The Discovery

I have been working the last couple of months around some data matching, string of course, and I started this one as a full beginner on the subject.

Well, I knew the very basics : equal strings match with the equal operator (I bet you knew this one too), nearly equal strings can also match with the like operator and correct usage of '%' et '_' characters (this one also).

I heard about soundex but I didn't knew much.
Indeed it's a decent algorithm but it has his limitations too, mostly about the language and the fact that if you want to match two strings, it acts like a boolean operator.
Then a working fellow enlightened me speaking of distances algorithms : we did some googling about them.

The first one we learned about was the Levenshtein distance.
The algorithm is not so hard to understand and seems very effective, but above of this, there's a plethora of distance algorithms !

Working with Oracle 11gR1, ultimately we found that Oracle 10g introduces two match algorithms : edit-distance and jaro-winkler, found into the utl_match package.
With a built-in function, I thought I'll be the new match-king in the firm I'm working for Continue reading 'Jaro-Winkler’s Algorithm Part One, Oracle utl_match Built-in Function'