Refine Your Data the Google Way

Cleaning up data has never been easier.

For all of the talk about Big Data, many hidebound organizations struggle with basic blocking and tackling. That is, their Small Data is a mess. Their internal structured information on customers, products, and employees contain many errors, duplicates, inconsistencies, and omissions. For instance, I’ve seen many customer lists with “similar but not equal” values like these:

[table id=7 /]

Forget the fact that the descriptions on the table above are the same. The value on the left is the one that truly matters for reporting purposes. Relational databases are predicated upon precision, not the extent to which data is “similar.” (Other matching technologies aren’t quite so demanding.) To use a fancy term, the data in the table above lacks referential integrity. And this is an enormous problem that inhibits many basic analyses, much less remotely accurate predictions.

What to do?

There’s no simple, ten-minute solution to tackling data-quality issues on an enterprise scale. As someone who prides himself on holistic thinking, quick fixes are unlikely to permanently solve institutional problems like these. Many organizational cultures in effect ignore issues, whether or not they appoint at chief data officer or not.

Fixing a culture isn’t a short-term endeavor, so it’s only natural for one to ask, What can technology do to help?

Fixing a culture isn’t a short-term endeavor.

Funny you should ask. A company can purchase and deploy expensive enterprise applications designed to purify data and maintain essential master lists. Many organizations have successfully improved core data issues with MDM tools. But what about smaller organizations lacking large budgets?

A Googly (Temporary) Solution?

One neat new tool is Google Refine. By no means as powerful as some of the best-of-breed MDM solutions, Refine allows users to easily clean up suspect records with a few clicks of the mouse. See for yourself:

Now, Google Refine is not an enterprise-grade data management solution. It does not integrate out of the box with existing systems and infrastructure like a proper MDM application. The latter offers more benefits, albeit at significantly higher cost. Still, Refine can assist individual organizations in short-term clean-up projects. What’s more, it can easily display the depth of a company’s data issues, similar in a way to basic SQL statements and Excel Pivot Tables.

Simon Says: Use Google Refine for quick hits.

The noise around Big Data is much louder than its signal. While figuring out what to do and how to do it, remember this: Manage your structured data well and you’ll find that you’ll get more out of the unstructured stuff. You’ll get more from Big Data if you treat its little brother well.

Feedback

What say you?


I wrote this post as part of the IBM for Midsize Business program. It provides midsize businesses with the tools, expertise, and solutions they need to become engines of a smarter planet. IBM has paid me contribute to this program but this post doesn’t express IBM’s positions, strategies, or opinions.

philanimated

Navigation

BACKRANDOMNEXT

YOUR AD HERE

Filed Under



Enjoy this post? Click here to subscribe to this RSS feed or here to sign up for my bi-monthly newsletter.


Submit a Comment

Your email address will not be published. Required fields are marked *