Award-winning author, dynamic keynote speaker, trusted advisor, & workplace tech expert 


Understanding Data Lakes

Thoughts on an increasingly popular way of storing information.
Jun | 12 | 2017


Jun | 12 | 2017
As I did the research for Analytics: The Agile Way, I encountered a relatively new concept in the business and tech landscape: the data lake. In this post and the next, I’ll broach the subject and describe why they matter.

Let’s begin by examining data lakes in contrast to data warehouses. The latter are predicated upon strictly defined schema—typically either of the star or snowflake variety. That is, they require writing and storing data in a very structured manner or shape. Data warehouses require the strict manipulation of data; they do not store data in its “natural state.”

The tightly controlled process of data warehousing often meets certain business needs—often reporting. Still, it fails to meet others. (More on that in my next post on the subject.)

Enter the Data Lake

I’ve been saying for a while now that traditional data warehouses can’t do it all. To this end, data lakes fulfill a genuine business need and software vendors have taken notice.

Yes, at a high level, both data warehouses and lakes store data but there’s a key difference: on-write vs. on-read.

Foolish is the soul who believes that there’s no difference between on-write vs. on-read.

Let me explain.

Data lakes still require schema but that schema isn’t pre-defined. It’s ad hoc or, if you like, on-read. Data is applied to a plan or schema as it is pulled out of a stored location, not as it goes in. Put differently, data remains in its unaltered (read: natural) state. Critically, a data lake doesn’t define requirements unless and until users query the data. As Margaret Rouse writes:

Each data element in a lake inherits unique identifier tagged with an extended set of metadata tags. When a business question arises, users can query the data lake for relevant data. The end goal: that those users can analyze that smaller dataset to help answer the question.

Think about it. When used correctly, data lakes offer business and technical users to query smaller, more relevant, and more flexible datasets. As a result, query times can drop to a fraction of what they would have been in a datamart, data warehouse, or relational database.

Simon Says

I see a bright future for data lakes. Data volumes continue to increase—especially of the unstructured variety. Data storage costs keep plummeting and data is increasingly valuable. Rather than trying to retrofit useful and mature technologies to a very new environment, expect intelligent organizations to experiment with and adopt data lakes over the next few years.

In my next post on this subject, I’ll describe a few specific ways that companies can use data lakes.

IBM paid me to write this post, but the opinions in it are mine.

Go Deeper


Thoughts on parallels between emerging technologies from last decade and the WFH debate.

Receive my musings, news, and rants in your inbox as soon as they publish.


 Blog E Data E Big Data E Understanding Data Lakes


Comments close 180 days after post publishes.


Blog E Data E Big Data E Understanding Data Lakes

Next & Previous Posts