Understanding Data Lakes

Thoughts on an increasingly popular way of storing information.

 Jun | 12 | 2017



 Jun | 12 | 2017

}

As I did the research for Analytics: The Agile Way, I encountered a relatively new concept in the business and tech landscape: the data lake. In this post and the next, I’ll broach the subject and describe why they matter.

Let’s begin by examining data lakes in contrast to data warehouses. The latter are predicated upon strictly defined schema—typically either of the star or snowflake variety. That is, they require writing and storing data in a very structured manner or shape. Data warehouses require the strict manipulation of data; they do not store data in its “natural state.”

The tightly controlled process of data warehousing often meets certain business needs—often reporting. Still, it fails to meet others. (More on that in my next post on the subject.)

Enter the Data Lake

I’ve been saying for a while now that traditional data warehouses can’t do it all. To this end, data lakes fulfill a genuine business need and software vendors have taken notice.

Yes, at a high level, both data warehouses and lakes store data but there’s a key difference: on-write vs. on-read.

Foolish is the soul who believes that there’s no difference between on-write vs. on-read.

Let me explain.

Data lakes still require schema but that schema isn’t pre-defined. It’s ad hoc or, if you like, on-read. Data is applied to a plan or schema as it is pulled out of a stored location, not as it goes in. Put differently, data remains in its unaltered (read: natural) state. Critically, a data lake doesn’t define requirements unless and until users query the data. As Margaret Rouse writes:

Each data element in a lake inherits unique identifier tagged with an extended set of metadata tags. When a business question arises, users can query the data lake for relevant data. The end goal: that those users can analyze that smaller dataset to help answer the question.

Think about it. When used correctly, data lakes offer business and technical users to query smaller, more relevant, and more flexible datasets. As a result, query times can drop to a fraction of what they would have been in a datamart, data warehouse, or relational database.

Simon Says

I see a bright future for data lakes. Data volumes continue to increase—especially of the unstructured variety. Data storage costs keep plummeting and data is increasingly valuable. Rather than trying to retrofit useful and mature technologies to a very new environment, expect intelligent organizations to experiment with and adopt data lakes over the next few years.

In my next post on this subject, I’ll describe a few specific ways that companies can use data lakes.

IBM paid me to write this post, but the opinions in it are mine.

From the Archives

 Analytics: The Agile Way Sponsored

 Blog E Data E Big Data E Understanding Data Lakes

← Previous Post Next Post →

0 Comments

Comments close 180 days after post publishes.

 Analytics: The Agile Way Sponsored

Blog E Data E Big Data E Understanding Data Lakes

Next & Previous Posts

← Previous Post Next Post →

PHIL SIMON

Understanding Data Lakes

Enter the Data Lake

Simon Says

From the Archives

Go Deeper

Publication of HBR Book on Agile Methods

The Wild Wild West of Analytics Programs

Visualizing My Students’ Slack Messages

 Blog E Data E Big Data E Understanding Data Lakes

0 Comments

Next & Previous Posts

0 Comments

Periodic Updates, Musings, & Rants

Academia

Tableau Public

GitHub

Site Map

Privacy Policy

Current Site Status

Site History