Improving data quality is key

Whether you are in the retail industry/e-commerce, telco industry, oil and gas industry, manufacturing industry, financial services, harnessing value through big data is essential especially for large and old corporations just waiting to be disrupted if they don’t move fast enough.  The good thing is, many of them are aware of the risk being disrupted. Look at the annual report of public listed companies and you will find the word digital/digitalization/digitization being used repeatedly. So they know and they include it in their mid-long term strategy somewhere, although it’s still not part of their top priority or at the heart of the CEO/senior management/board but that’s another story for another day.

Before they can start reaping the value/benefits from big data, the main problem lies on the data quality itself, which includes the following:

  1. Don’t have the right data
  2. Data is everywhere, not organized, not centralized and located in multiple spreadsheets/documents in different systems in various divisions
  3. No common identifier. In the case of customers data, for example, you can end up having 3 different profiles for the same people! How are you going to understand/profile your customers better and customized products/services accordingly for them?
  4. Worst of all, data is still hard copy

Because of the lack in data quality, many organizations give up or requires longer time to get insights from data, simply because they have to do lots of cleaning up and organizing data before the data makes sense to be explored further. Now this is where you will need data engineers more than data scientist/data analyst. This is the work of data engineers.

Stumbled upon a tweet  which describes perfectly this issue, albeit in coding format:


Predicting, analyzing is the easy part especially now that there are lots of tools/softwares available – Tableau, Alteryx etc which I’ve used before.

So first things first, improve your data quality. It’s a painful exercise and it may take longer than you think but trust me, if you ignore this problem, you can forget about machine learning and AI. I’ve experienced this myself, as part of my Analytics Lab project I took when I was in MIT and also the data analytics project I’m currently working on (which is still in proof-of-concept mode). The latter gives me the realization that our companies have a long way to go having seen the state of their data.

You can approach this issue in 2 ways – (1) understand your data requirements across your organization, check what data you currently have and then develop a data roadmap; (2) create pilot/proof-of-concept projects and peel one by one the data issues you face and concurrently solve it on piecemeal basis. I would recommend (2) based on the belief of “learning by doing” – sometimes you won’t know what data you need until you start working on it. So approach (2) is faster and as you do more projects you get to learn more, fix some of the issues and refine it further as opposed to approach (1) where it will take maybe at least 1 year to understand all the requirements especially for large companies. Plus, you will never it right/perfect the first time.

So do take this seriously. Here’s an insightful article on how machine learning race is really a data race. Have a read!

1 Comment

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s