Track Source At Column Level Inside Your Datawarehouse

Add a comment

Every one pretty much knows that code have to be commented.
In C, C++, Java, whatever your favorite language is.

In most ETL dataflow I've encountered, I've always had to reverse everything to know where the data came from.

Sometimes I can find comments in the code, but often I can't.

When working on feeding a datawarehouse (not speaking about datamarts here), the data are matched and cleansed, but except those crucial steps, most of the columns are going down straightforward.

Nevertheless, the same question remains : WHERE my data came from, WHAT is feeding them ?

So where do you comment your code ?
You can mostly comment inside :

  • documentation
  • the ETL process
  • the database





Documentation



This is most used way of tracking a project I've met.
The problems with commenting inside a documentation are multiples.

First you have to find where is the documentation and if you're reading the last version. Granted, with technologies like MS sharepoint it is easier to resolve those steps.

Then, you can spend bunch of time on the presentation task with sheets, boards and schema and making all of this looking nice and clear.

More than this, the documentation can grow very large. Imagine a fifty table projects, where you describe every computation from every column. This can lead to two or three hundreds pages.
It's not a problem of his own, but if a documentation doesn't look understandable, maybe one people on two will overlook it.

And the documentation still have to be up-to-date : at the start of a project, you have time and money and you do the documentation well according to the specifications.

At the end of the project, you're late, you're pushed into crazy coding to achieve your whole new over-complicated ETL process identified the day before, and of course your working on a brand new project.

After five year of the project life, when ten different people have touched everything, with different skills and understanding levels, I can ensure you that the documentation is outdated.

I've almost stopped to use those "external" documentations except to describe the meta-flow at database / files level.

My target for this (slowly changing) documentation is exploitation and I have to stay quite high-level about what I'm talking about.


ETL process



Coding inside your ETL tool have advantages, you're explaining to the next developer what you're doing. If it's simple, there is no real need to comment everything. When you're doing things a bit tricky, it's really worth it.

Another good thing is that you can follow what's going inside your DWH on the column level by ready the metadatas.

But still, all of us aren't ETL developers and don't know how to navigate inside a tool to find wanted information.

I use comments inside ETL process when I'm doing sophisticated things, either related to complex query or advanced use of the product.
My target about the comments are other ETL developers so I don't mind using technical vocabulary.


Database



I've done this on the current DWH I'm working on.

I'm using the comments inside the database (table and column comments) to track what process feed the table, what are my source column, and the regular comment of the column, taken from the source system.

If source are multiple, I'm using a text separator to create a list : I'm not use this list to achieve complex query, I just want it to be readable.

Here's a script of a DWH language table from another database :

COMMENT ON TABLE  DWH.LANGUAGE               IS 'DWH_DB1.M_ALI_DWH_LANGUAGE ; Maps language key numbers to their names';

COMMENT ON COLUMN DWH.LANGUAGE.LANGUAGE_KEY  IS 'M_ALI_DWH_LANGUAGE ; [DB1].dbo.LANGUAGE.LANGUAGE_KEY        ; DB1 number for language';
COMMENT ON COLUMN DWH.LANGUAGE.LANGUAGE_NAME IS 'M_ALI_DWH_LANGUAGE ; [DB1].dbo.LANGUAGE.LANGUAGE_NAME       ; DB1 name of language';
COMMENT ON COLUMN DWH.LANGUAGE.FG_IS_SPOKEN  IS 'M_ALI_DWH_LANGUAGE ; [DB1].dbo.LANGUAGE.IS_SPOKEN_LANGUAGE  ; Is spoken language (1=yes, 0=no)';
COMMENT ON COLUMN DWH.LANGUAGE.FG_IS_WRITTEN IS 'M_ALI_DWH_LANGUAGE ; [DB1].dbo.LANGUAGE.IS_WRITTEN_LANGUAGE ; Is written language (1=yes, 0=no)';
COMMENT ON COLUMN DWH.LANGUAGE.ISO_639_2B_CD IS 'M_ALI_DWH_LANGUAGE ; [DB1].dbo.LANGUAGE.MARC21_LANG_CODE    ; MARC21 language code. Almost match with ISO 639-2B code';
COMMENT ON COLUMN DWH.LANGUAGE.MOD_STAMP     IS 'M_ALI_DWH_LANGUAGE ; [DB1].dbo.LANGUAGE.MOD_STAMP           ; Last insert or update date';
COMMENT ON COLUMN DWH.LANGUAGE.MOD_USER      IS 'M_ALI_DWH_LANGUAGE ; [DB1].dbo.LANGUAGE.MOD_USER            ; User of modifcation';
COMMENT ON COLUMN DWH.LANGUAGE.MOD_FLAG      IS 'M_ALI_DWH_LANGUAGE ; [DB1].dbo.LANGUAGE.MOD_FLAG            ; Type of last modification. ''I''=insert, ''U''=update, ''D''=delete';
COMMENT ON COLUMN DWH.LANGUAGE.ID_LOG        IS 'M_ALI_DWH_LANGUAGE ; DWH_LOG.TEC_TRT_LOG ; DWH log identifier';



In the table comment, I keep track of the project name and the name of the entity that load this table (in Genio this is a module, in SSIS that would be a dataflow and in PowerCenter this is called a mapping).
The name being unique inside the project, I can find what feeds my table.

In the column comments, I first repeat the loading entity name (in case of multiple project / entity loading the same table, for now I'm successfully avoiding that), then i keep track of the source database.table.column information, and then there is the classical comments helping about what is inside it.

You've probably spotted the last column to be an id_log, I'm using it to answer the remaining questions about a load : when, how long, did I run into errors, what errors, what was the proportion of update and insert ?
But that's a whole subject on his own !

On more good thing is that your can write at table or column level MANUALLY UPDATE (or any word that will suit), to tell any data managers that they have to keep an eye on this table, and you can do this with a single query.

Even if later one developer adds another column without commenting it, there's a 95% chance that the source table have changed and to still find what is feeding this new column.

That's probably not perfect, but at least every one who use a SQL tool or who know to query the system tables can find very valuable information.

And that's more people than ETL developers : project manager with developing past, reporting developers, database administrators, et cetera.

From a developer point of view, I find the overhead of commenting columns when creating them is very decent and a lot less a pain against filling text documentations.

And it you name your items with the same schema, one little query can fill in 80% of it alone.


Conclusion



Every dataflow has to be documented. But not everyone need the same comprehension of the duty.

I've stopped writing technical documentation the usual way. I'm quite sure my past documentations have been read maybe once or twice, and it's, in my opinion, a huge time sink.

Of course, if the documentation task is fully budgeted and I'm asked to do so, I will proceed.

But lately, I'm working in a more reactive manner (understand : "I need that for yesterday, please code it.") and I don't have the time for the classical and printable documentation.

Using comments while coding is something natural to me, so I'm using this to provide enough information for everyone who can access it.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Google Bookmarks
  • Facebook
  • TwitThis
  • blogmarks
  • email
  • Furl
  • LinkedIn

This entry is filed under Datawarehousing. And tagged with , , , , , , . You can follow any responses to this entry through RSS 2.0. You can leave a response, or trackback from your own site.

  1. No Comments