Refine Your Data the Google Way

Cleaning up data has never been easier.
Dec | 10 | 2013

Dec | 10 | 2013

For all of the talk about Big Data, many hidebound organizations struggle with basic blocking and tackling. That is, their Small Data is a mess. Their internal structured information on customers, products, and employees contain many errors, duplicates, inconsistencies, and omissions. For instance, I’ve seen many customer lists with “similar but not equal” values like these:

[table id=7 /]

Forget the fact that the descriptions on the table above are the same. The value on the left is the one that truly matters for reporting purposes. Relational databases are predicated upon precision, not the extent to which data is “similar.” (Other matching technologies aren’t quite so demanding.) To use a fancy term, the data in the table above lacks referential integrity. And this is an enormous problem that inhibits many basic analyses, much less remotely accurate predictions.

What to do?

There’s no simple, ten-minute solution to tackling data-quality issues on an enterprise scale. As someone who prides himself on holistic thinking, quick fixes are unlikely to permanently solve institutional problems like these. Many organizational cultures in effect ignore issues, whether or not they appoint at chief data officer or not.

Fixing a culture isn’t a short-term endeavor, so it’s only natural for one to ask, What can technology do to help?

Fixing a culture isn’t a short-term endeavor.

Funny you should ask. A company can purchase and deploy expensive enterprise applications designed to purify data and maintain essential master lists. Many organizations have successfully improved core data issues with MDM tools. But what about smaller organizations lacking large budgets?

A Googly (Temporary) Solution?

One neat new tool is Google Refine. By no means as powerful as some of the best-of-breed MDM solutions, Refine allows users to easily clean up suspect records with a few clicks of the mouse. See for yourself:

Now, Google Refine is not an enterprise-grade data management solution. It does not integrate out of the box with existing systems and infrastructure like a proper MDM application. The latter offers more benefits, albeit at significantly higher cost. Still, Refine can assist individual organizations in short-term clean-up projects. What’s more, it can easily display the depth of a company’s data issues, similar in a way to basic SQL statements and Excel Pivot Tables.

Simon Says: Use Google Refine for quick hits.

The noise around Big Data is much louder than its signal. While figuring out what to do and how to do it, remember this: Manage your structured data well and you’ll find that you’ll get more out of the unstructured stuff. You’ll get more from Big Data if you treat its little brother well.


What say you?

I wrote this post as part of the IBM for Midsize Business program IBM has paid me contribute to this program but this post doesn’t express IBM’s positions, strategies, or opinions.

Receive my musings, news, and rants in your inbox as soon as they publish.


Blog E Data E Big Data E Refine Your Data the Google Way

Related Posts

The Wild Wild West of Analytics Programs

s I write these words, I'm in the midst of teaching my fourth year of analytics courses at ASU. To be sure, it feels longer than that. That's probably because, during this time, I have done more than merely fulfill my 4/4 teaching load. I wrote a...

Book Review: Wonder Boy

In 2011, I moved from NJ to Vegas. It didn't take long for me to hook up with the Vegas tech scene and the Downtown Project. Over the course of my five years in Sin City, I attended events, spoke at Zappos's HQ a few times, met plenty of smart cookies, and learned a...

Thoughts on Reaching 500 Google Scholar Citations

Introduction Back when I started writing books in 2008, I largely ignored whether academics cited my work—much less how often. In the whole scheme of things, it just didn't seem to matter at the time. This feeling continued well into 2014. Although I knew that a...


Submit a Comment

Your email address will not be published. Required fields are marked *