The Case for Data Lakes

Data is more complicated than ever. One-stop shopping sounds pretty good.

 Jun | 19 | 2017



 Jun | 19 | 2017

}

In my last post, I briefly defined a data lake and described how it differs from a traditional data warehouse. Today I’ll make the case for using one and offer a few words of caution before getting started.

Before I do, though, I’d like to take a little trip down memory lane. In my pre-author and -professor days, I frequently wrote complex reports from enterprise systems. At a high level, I can say three things about that data. First, much of the time, it was incomplete, duplicated, or flat-out wrong. Second, it was everywhere. The data typically lay in a number of different places: relational databases, legacy systems, business-intelligence applications, Microsoft Access databases, Excel spreadsheets, etc.

This brings me to the third commonality. Despite the wide variety of places that housed the data and the myriad issues I encountered, one thing remained relatively constant: it was almost always structured. That is, it typically played nicely with calculations, counts, summaries, averages, minimums, maximums, etc. spreadsheets, and the like. My most effective weapons consisted of SQL statements, stored procedures, and database views. Collectively, they were invaluable: they helped me make sense of some massive structured datasets.

We no longer live in that simple world.

Unstructured Data Is Eating the World

Most of today’s data is of the unstructured variety—perhaps as much as 85 percent. Think tweets, blog posts, articles, photos, videos, etc. (E-mails are semi-structured.) As such, those same powerful tools for structured data no longer cut the mustard.

One big lake doesn’t ameliorate data-quality issues such as duplicate or inaccurate records.

Imagine needing to deploy vastly different tools on different data sources and types. Now imagine trying to stitch them all together. By way of analogy, what if you could only find non-fiction books at your local library and you had to go to a separate location to borrow 1984 or Dune?

As Jacqueline Lee writes, “as [organizations] seek value in unstructured data, some are amassing data in giant data lakes hoping that it will someday yield insight

Simon Says: Remember the following before you get started.

The benefits of data lakes are hard to overstate, but before continuing a few words of caution are in order. First, tossing structured data into a data lake causes it to lose its structure and, by extension, some of its value. Second, one big lake doesn’t ameliorate data-quality issues such as duplicate or inaccurate records. A data lake doesn’t magically cleanse dirty data. GIGO still applies.

Finally, for years, data warehouses have been able to handle incremental extract, transform, and load (ETL) processes. Loading a data lake, however, tends to be a binary (read: all or nothing). Fixing a mistake may require a complete reload. In the process, it may increase the time required to derive insights.

IBM paid me to write this post, but the opinions in it are mine.

From the Archives

 BI CRM ERP Microsoft Sponsored

 Blog E Data E Big Data E The Case for Data Lakes

← Previous Post Next Post →

0 Comments

Comments close 180 days after post publishes.

 BI CRM ERP Microsoft Sponsored

Blog E Data E Big Data E The Case for Data Lakes

Next & Previous Posts

← Previous Post Next Post →

PHIL SIMON

The Case for Data Lakes

Unstructured Data Is Eating the World

Simon Says: Remember the following before you get started.

From the Archives

Go Deeper

The Beauty of Structured Data

Why I Often Answer Analytics Students’ Questions With Questions

Thoughts on Reaching 500 Google Scholar Citations

 Blog E Data E Big Data E The Case for Data Lakes

0 Comments

Next & Previous Posts

0 Comments

Periodic Updates, Musings, & Rants

Academia

Tableau Public

GitHub

Site Map

Privacy Policy

Current Site Status

Site History