Monday, June 15, 2009

Data Quality!

Why is Data Quality Important?
Well, DUH! The whole exercise of an informatics initiative is to be able to make intelligent decisions, faster. You wouldn’t buy a $300,000 house with less than $50,000 in annual income now, would you? Oh sorry, that was a bad example, but you get the idea. So how do you ensure quality of data that you can depend on, to make decisions?

Data Profiling
In my previous posts, I mentioned programmatical cleansing of data and usage of “discipline”. This is where the discipline part of your informatics initiative comes in. Understand your source! Garbage in, garbage out, folks! A lot of times, the mantra of “speed to market “ trumps this activity called profiling, with the assumption that data quality is fine, usually sworn on his/her first born by the Director who handles the operational system where you are sourcing your data from. Subsequently, you have data on time, but you make a decision based on what you see, only to find out later that the data lied to you and naturally your fury is directed towards the director who swore that the data was fine and just got fired because this whole fiasco cost your company a few million bucks.

What went wrong? Was the Director wrong? Did he/she lie to you? The answer is, NO. They didn’t lie to you. The fundamental difference here is, the way the data is being used. The operational systems person is not looking at 10 years worth of data to identify trends. They are not using it to formulate predictive models to go where your vision is taking you. To them, the data quality was excellent, for day to day operations. You, on the other hand, needed at least 5 years worth of data, to come up with a trend. While it was available in the operational system, nobody profiled this data and so you missed the fact that a key measure, for the past 4 years has had null values in it, flat-lining your trend analysis graph. Voila, the business case for data profiling has been made!

Data Validation
Here is where your ever so faithful IT folk can do some nifty programming to get you the quality you have always deserved. A lot of these issues can be taken care of, upfront. Element level validations can be put in at the data acquisition level, more complex validations can be put in at the integration level. This approach ensures a couple of things. One is that less garbage gets into your warehouse and the other is that processing is not heavily impacted as you are splitting the activity up in two different points of your data flow (a deeper, more technical dive on this later, as my inner-geek is screaming “Stop! Do not say that without proper clarification! You are about to get roasted over a coal fire!”).
The point of all this is, quality of data is very important in an informatics initiative. In fact, that is the most important thing. So, when you hear someone saying, “data quality is my number one issue”, you can safely assume that your informatics initiative is doomed.