Jaro-Winkler’s Algorithm Part One, Oracle utl_match Built-in Function
The Discovery
I have been working the last couple of months around some data matching, string of course, and I started this one as a full beginner on the subject.
Well, I knew the very basics : equal strings match with the equal operator (I bet you knew this one too), nearly equal strings can also match with the like operator and correct usage of '%' et '_' characters (this one also).
I heard about soundex but I didn't knew much.
Indeed it's a decent algorithm but it has his limitations too, mostly about the language and the fact that if you want to match two strings, it acts like a boolean operator.
Then a working fellow enlightened me speaking of distances algorithms : we did some googling about them.
The first one we learned about was the Levenshtein distance.
The algorithm is not so hard to understand and seems very effective, but above of this, there's a plethora of distance algorithms !
Working with Oracle 11gR1, ultimately we found that Oracle 10g introduces two match algorithms : edit-distance and jaro-winkler, found into the utl_match package.
With a built-in function, I thought I'll be the new match-king in the firm I'm working for :
(
SELECT 'My surname is Waldar' AS s1, 'My shortname is Waldar' AS s2 FROM dual union ALL
SELECT 'Waldar' , 'Walder' FROM dual union ALL
SELECT 'Perfect match' , 'Perfect match' FROM dual union ALL
SELECT 'Awful match' , 'Bad similitude' FROM dual
)
SELECT s1, s2, utl_match.jaro_winkler(s1, s2) AS ora_jw
FROM datas;
S1 S2 ORA_JW
-------------------- ---------------------- --------
My surname IS Waldar My shortname IS Waldar 0.9627272727272727
Waldar Walder 0.9333333333333333
Perfect match Perfect match 1.0
Awful match Bad similitude 0.5372294372294372
The results are good and also they provide a score, so you can parameter it to look for matches within a confidence interval :
FROM datas
WHERE utl_match.jaro_winkler(s1, s2)> 0.95;
S1 S2
-------------------- ----------------------
My surname IS Waldar My shortname IS Waldar
Perfect match Perfect match
But then, the troubles begun...
Mister Oracle, tell me more
Finding such high value built-in makes me always the same effect, run to Tahiti as fast as I can... or more realistically, I just click on the bookmark.
Enter "utl_match" as the searched text and then...
Summary of Documentation Search Results: utl_match
Your search term did not match any of these documentation libraries:
Oracle Database 8i, 9i Release 2, 10g Release 1, 10g Release 2, 11g Release 1, Express Edition
Oracle Fusion Middleware 11g Release 1
Oracle Application Server 9i Release 2, 10g Release 1, 10g Release 2
Oracle Collaboration Suite 9i Release 2, 10g Release 1
What ?!? Let's try Jaro-Winkler...
And it's a success !
This is Oracle Warehouse Builder User's Guide, I can't find any detailed algorithm of the function, and there's a typo in Winkler's name.
Damn it, I'm not satisfied.
Let's google it !
The best thread that sums it all is findable on OTN.
Diana Lorentz from Oracle says one year ago that she'll try to find out why this is not documented.
Being in the acknowledgments page of Laurent Schneider's book, she can be trusted !
But I have to suppose that she couldn't figure why such a hole inside Oracle's documentation.
Two more threads :
- http://forums.oracle.com/forums/thread.jspa?threadID=421326
- http://forums.oracle.com/forums/thread.jspa?threadID=385950
That's ok, I can't rely on official documentation. I'm an ETL / BI developer and that the kind of situation I have to adapt to every day :
- functional documentation that must be reversed-engineered from SQL
- technical documentation that is not up-to-date, or utterly outdated
- hidden code inside views and triggers
2009-09-04 Edit : With the release of Oracle 11g R2, there is actually some official Oracle documentation on the functions :
http://download.oracle.com/docs/cd/E11882_01/appdev.112/e10577/u_match.htm#CBHBICDE
Nothing on the algorithms used, but no one can blame Oracle for that.
Thanks to Laurent Schneider for pointing this out to me and as he said, Diana Lorentz did her job !
Jaro-Winkler Algorithm ?
First, I need to fully understand the Jaro-Winkler algorithm. I'll start with wikipedia's entry.
I know how much you can't blindly rely on it, but at least it's something to begin with and it will help me to learn the proper vocabulary to make further researches more accurate.
The Jaro-Winkler is in fact the Jaro-distance metric boosted by a Winkler-introduced formula to help strings that match from the beginning.
The string comparator value implemented by Jaro formula is :
where s1 and s2 are the strings with lengths str_len1 and str_len2, respectively.
#common are the number of matching characters within the range of greatest(str_len1, str_len2)/2-1 .
#transpositions are the number of matching characters that are not in the right order.
I found this sentence on the wiki : "The score [of the Jaro distance metric] is normalized such that 0 equates to no similarity and 1 is an exact match"
Well, after reading some papers, I disagree with this. A Jaro-distance of 0 should grant equal strings. A Jaro-similarity of 1 should grant equal strings.
But I find this normalization quite odd.
Also, in the link I found the original C code of the string comparison function.
Even if I didn't write nor read a single line of C since 1998, I can still understand most of it.
And this piece of code was seen nowhere into the definition :
if (weight> 0.7) {
/* Adjust for having up to the first 4 characters in common */
j = (minv>= 4) ? 4 : minv;
for (i=0;((i<j)&&(ying_hold[i]==yang_hold[i])&&(NOTNUM(ying_hold[i])));i++);
if (i) weight += i * 0.1 * (1.0 - weight);
[...]
}
return(weight);
I can't neither find any clear paper on this algorithm (is it cursed ?)...
In Wikipedia we have a warning :
Note that Winkler's "reference" C code differs in at least two ways from published accounts of the Jaro-Winkler metric. First is his use of a typo table (adjwt) and also some optional additional tolerance for long strings
The best link I could find was a link from wikipedia about a java class.
They speak about Distance and Proximity terms, but nothing about the boost only when the Jaro comparison is superior to 0.7.
At least, this link is well written and the terms Matches, Transposition and Winkler Modification are explained with text and formulas.
Strange Behaviors
Let's go back to the Oracle functions, especially utl_match.jaro_winkler().
I had some strange behavior with the built-in function.
Algorithm
Even if I couldn't grab a full paper on the algorithm, at least I now know things about it.
Transitivity
According to the algorithm, jaro_winkler(a, b) = jaro_winkler(b, a).
Nothing in the formula show the opposite.
But... :
(SELECT 'THE BARON' AS s1, 'THE RACE SON' AS s2 FROM dual)
SELECT s1, s2,
utl_match.jaro_winkler(s1, s2) AS ora_jw1,
utl_match.jaro_winkler(s2, s1) AS ora_jw2
FROM datas;
S1 S2 ORA_JW1 ORA_JW2
--------- -------------- -------- --------
THE BARON THE RACE SON 0.861111 0.886111
Ok, that's not expected...
Pertinence
Is the built-in function totally accurate ?
Well, if I raised this chapter it's because it's not.
One counter-example is enough to prove the function wrong.
Let's compare "AC DC" with "SACRED SKY" (don't ask...) :
(SELECT 'AC DC' AS s1, 'SACRED SKY' AS s2 FROM dual)
SELECT s1, s2, utl_match.jaro_winkler(s1, s2) AS ora_jw
FROM datas;
S1 S2 ORA_JW
----- ---------- --------
AC DC SACRED SKY 0.733333
Now, let's solve this case with a notebook !
We first have to use the Jaro comparator value using this formula :
The two strings have a length of respectively 5 (str_len1) and 10 (str_len2).
So we have to find the matching characters within a range of floor(greatest(5, 10)/2)-1 = 4.
Let's draw it :
We found four matched characters (#common).
How many transpositions are need to achieve the matching characters of the first string from the matching characters of the second string ?
I've also illustrated this :
We find that two transpositions are needed (the D and the space), so we have enough elements to compute the jaro string comparator value :
Dj(s1,s2) = 0.65
As the first character of the two strings differs, there is no Winkler boost so 0.65 is our Jaro-Winkler similarity score, and it's not the result returned by utl_match.jaro_winkler().
During gathering all this stuff about this algorithm, I saw a dark spot in the definition of the transposition.
Remember : #transpositions are the number of matching characters that are not in the right order.
What defines exactly the right order ? The way I did it previously was by writing matching letters in their original position and counting those who doesn't match.
But what if I write matching letters in their matching position and using their ranking to check if they are ordered.
This would represent as follow :
The first three characters from the second matching string are ordered, only the last one is not ordered (imagine writing them gradually).
Therefore, just one transposition is needed.
Dj(s1,s2) = 0.691666
Still not the utl_match result.
Reading the alias-i paper, I found something very interesting in the transposition definition :
The total number of transpositions is the number of half transpositions divided by two, rounding down.
Dj(s1,s2) = 0.733333
Here we find the same result that utl_match.jaro_winkler().
So, maybe (note the emphasis on this word) that's how the function is working.
Winkler modification
This is just a check to see if utl_match.jaro_winkler() boost only after a jaro string comparator score of 0.7.
Let's use two simple strings :
(SELECT 'ABCDE' AS s1, 'ABFFF' AS s2 FROM dual)
SELECT s1, s2, utl_match.jaro_winkler(s1, s2) AS ora_jw
FROM datas;
S1 S2 ORA_JW
----- ----- ------
ABCDE ABFFF 0.68
Let's first compute Dj(s1, s2), and then boost it :
Dj(s1,s2) = 0.6
Dw(s1,s2) = Dj(s1,s2) + least(2, 4) * 0.1 * ( 1 - Dj(s1,s2) )
Dw(s1,s2) = 0.6 + 2 * 0.1 * ( 1 - 0.6 )
Dw(s1,s2) = 0,68
And it's the same score that the one provided by utl_match.jaro_winkler().
The Winkler modification is indeed applied even if the Jaro score is inferior to 0.7.
Is it an error ? Really, I can't tell.
Performances
Now I want to test about the performance of the function.
Writing the query may matter
To compute a Jaro-Winkler score from a table against another table, you have to cross join them.
Be aware about the cardinality of this join, you really don't want to check 10 millions rows against a 1 millions rows reference table.
Important aspect : The following pattern has been encountered on an Oracle 11.1.0.6 development server, running on W2k3.
I couldn't reproduce on my home Oracle 10g Express running on WinXP.
To illustrate some performance, I'm creating and filling two tables :
CREATE TABLE jrwt_10k (name varchar2(10));
INSERT /*+ APPEND */ INTO jrwt_10
SELECT DBMS_RANDOM.STRING('A', 10) FROM dual
connect BY level <= 1e1;
INSERT /*+ APPEND */ INTO jrwt_10k
SELECT DBMS_RANDOM.STRING('A', 10) FROM dual
connect BY level <= 1e4;
commit;
exec dbms_stats.gather_table_stats (ownname => 'WLDR', tabname => 'JRWT_10' );
exec dbms_stats.gather_table_stats (ownname => 'WLDR', tabname => 'JRWT_10k');
So at most, I'm working with a 100 thousands rows set.
Let's filter on the couples who have a Jaro-Winkler comparison of at least 0.62.
jA.NAME, jB.NAME,
utl_match.jaro_winkler(jA.NAME, jB.NAME) AS JW
FROM
jrwt_10 jA
CROSS JOIN jrwt_10k jB
WHERE
utl_match.jaro_winkler(jA.NAME, jB.NAME)> 0.62
ORDER BY
1 ASC, 3 DESC;
The query answers me in about 10 seconds.
Removing the jaro_winkler function into the WHERE part, thus adding way more rows :
jA.NAME, jB.NAME,
utl_match.jaro_winkler(jA.NAME, jB.NAME) AS JW
FROM
jrwt_10 jA
CROSS JOIN jrwt_10k jB
ORDER BY
1 ASC, 3 DESC;
This time the query took only 5 seconds to produce the result !
I really didn't expect this one to be faster (and twice faster) than the previous one !
I did reproduce the same output using a 10 rows and a 20k, 30k, 70k table.
Times were somewhere between 33% and 50% faster without the where clause.
Without the where clause, the optimizer choose to do a MERGE JOIN CARTESIAN operation.
With the where clause, it choose to do some NESTED LOOPS. The cost is ten times lower, but not the response time.
I didn't find any hint to force the MERGE JOIN CARTESIAN with the where clause, so if anyone has an idea feel free the share it !
Same optimizer path with Oracle 10g XE.
Now I'm trying to dupe Oracle : I want that filter badly, but I also want good response time.
If I can't filter "before", I will filter "after" :
jA.NAME, jB.NAME,
max(utl_match.jaro_winkler(jA.NAME, jB.NAME)) AS JW
FROM
jrwt_10 jA
CROSS JOIN jrwt_100k jB
GROUP BY
jA.NAME, jB.NAME
HAVING
max(utl_match.jaro_winkler(jA.NAME, jB.NAME))> 0.62
ORDER BY
1 ASC, 3 DESC;
This trick is working, I've got my good response time and expected output...
But I'm not very sure WHY I have to write my query like this, and WHY I have so many different results on different computers.
I can't really draw conclusion because it could be a flaw from the hardware / OS / drivers and so on.
Stability
As reported in the previous OTN links, other users experienced ORA-03113 error.
I experienced it to, on my real work queries.
But it's not a query that always crash or always work. It's more built around "sometimes".
I'm no expert at reading trace files, and I didn't fill any metalink request.
I couldn't reproduce those end-of-file communication channel on my 10g XE, so I still have to find a query that won't work most of the time.
Part One Summary
At first I was very excited by this Jaro-Winkler Oracle built-in function.
But I find it too much surrounded by darkness :
- Not really any documentation from Oracle
- No perfect documentation about the algorithm
- Not sure of what is coded
- Performance issues
- Stability issues
That's a long list for something I expected to simplify my life, and there's no way I'm gonna push this in production.
The Part two will deal with other solutions, still around Jaro-Winkler's algorithm.
On a side note, this is by far the longest article I wrote here.
It's hard to find the right words, when I re-read myself I've got the impression to reuse the same adjectives over and over.
It took me several hours to write it down, so I hope you'll find it interesting !
This entry is filed under SQLing. And tagged with algorithm, built-in, distance, Jaro-Winkler, Laurent Schneider, Match, Oracle, string comparator, utl_match. You can follow any responses to this entry through RSS 2.0. You can leave a response, or trackback from your own site.
[...] Jaro-Winkler's Algorithm Part One, Oracle utl_match Built-in … [...]