In my last post, I briefly defined a data lake and described how it differs from a traditional data warehouse. Today I’ll make the case for using one and offer a few words of caution before getting started.
Before I do, though, I’d like to take a little trip down memory lane. In my pre-author and -professor days, I frequently wrote complex reports from enterprise systems. At a high level, I can say three things about that data. First, much of the time, it was incomplete, duplicated, or flat-out wrong. Second, it was everywhere. The data typically lay in a number of different places: relational databases, legacy systems, business-intelligence applications, Microsoft Access databases, Excel spreadsheets, etc.
This brings me to the third commonality. Despite the wide variety of places that housed the data and the myriad issues I encountered, one thing remained relatively constant: it was almost always structured. That is, it typically played nicely with calculations, counts, summaries, averages, minimums, maximums, etc. spreadsheets, and the like. My most effective weapons consisted of SQL statements, stored procedures, and database views. Collectively, they were invaluable: they helped me make sense of some massive structured datasets.
We no longer live in that simple world.
Unstructured Data Is Eating the World
Most of today’s data is of the unstructured variety—perhaps as much as 85 percent. Think tweets, blog posts, articles, photos, videos, etc. (E-mails are semi-structured.) As such, those same powerful tools for structured data no longer cut the mustard.
One big lake doesn’t ameliorate data-quality issues such as duplicate or inaccurate records.
Imagine needing to deploy vastly different tools on different data sources and types. Now imagine trying to stitch them all together. By way of analogy, what if you could only find non-fiction books at your local library and you had to go to a separate location to borrow 1984 or Dune?
As Jacqueline Lee writes, “as [organizations] seek value in unstructured data, some are amassing data in giant data lakes hoping that it will someday yield insight
Simon Says: Remember the following before you get started.
The benefits of data lakes are hard to overstate, but before continuing a few words of caution are in order. First, tossing structured data into a data lake causes it to lose its structure and, by extension, some of its value. Second, one big lake doesn’t ameliorate data-quality issues such as duplicate or inaccurate records. A data lake doesn’t magically cleanse dirty data. GIGO still applies.
Finally, for years, data warehouses have been able to handle incremental extract, transform, and load (ETL) processes. Loading a data lake, however, tends to be a binary (read: all or nothing). Fixing a mistake may require a complete reload. In the process, it may increase the time required to derive insights.