Understanding Data Lakes

Thoughts on an increasingly popular way of storing information.

Introduction

As I did the research for Analytics: The Agile Way, I encountered a relatively new concept in the business and tech landscape: the data lake. In this post and the next, I’ll broach the subject and describe why they matter.

Let’s begin by examining data lakes in contrast to data warehouses. The latter are predicated upon strictly defined schema—typically either of the star or snowflake variety. That is, they require writing and storing data in a very structured manner or shape. Data warehouses require the strict manipulation of data; they do not store data in its “natural state.”

The tightly controlled process of data warehousing often meets certain business needs—often reporting. Still, it fails to meet others. (More on that in my next post on the subject.)

Enter the Data Lake

I’ve been saying for a while now that traditional data warehouses can’t do it all. To this end, data lakes fulfill a genuine business need and software vendors have taken notice.

Yes, at a high level, both data warehouses and lakes store data but there’s a key difference: on-write vs. on-read.

Foolish is the soul who believes that there’s no difference between on-write vs. on-read.

Let me explain.

Data lakes still require schema but that schema isn’t pre-defined. It’s ad hoc or, if you like, on-read. Data is applied to a plan or schema as it is pulled out of a stored location, not as it goes in. Put differently, data remains in its unaltered (read: natural) state. Critically, a data lake doesn’t define requirements unless and until users query the data. As Margaret Rouse writes:

Each data element in a lake inherits unique identifier tagged with an extended set of metadata tags. When a business question arises, users can query the data lake for relevant data. The end goal: that those users can analyze that smaller dataset to help answer the question.

Think about it. When used correctly, data lakes offer business and technical users to query smaller, more relevant, and more flexible datasets. As a result, query times can drop to a fraction of what they would have been in a datamart, data warehouse, or relational database.

Simon Says

I see a bright future for data lakes. Data volumes continue to increase—especially of the unstructured variety. Data storage costs keep plummeting and data is increasingly valuable. Rather than trying to retrofit useful and mature technologies to a very new environment, expect intelligent organizations to experiment with and adopt data lakes over the next few years.

Feedback

What say you?


In my next post on this subject, I’ll describe a few specific ways that companies can use data lakes.

This post was brought to you by IBM Global Technology Services. For more content like this, visit IT Biz Advisor

Filed Under



Enjoy this post? Click here to subscribe to this RSS feed or here to sign up for my bi-monthly newsletter.


Submit a Comment

Your email address will not be published. Required fields are marked *