Unraveling the Concepts of a Data Warehouse and a Data Lake: How to Build a Modern Data Architecture to Overcome Modern Data Problems

Written by Admin | May 11, 2021 7:33:15 PM

There is a lot of noise in the data and analytics industry today about the modern data stack and how to build a modern data architecture to support it. Data-driven organizations understand the value in making big data accessible to all business consumers but find that there is too much information—and hence confusion—around how to modernize and where to start.

Technology vendors have catchy slides showing how they hold the keys to modernization, suggesting if you do not have their product, you are behind the times and behind your competition. But modernization isn’t just a technology play, it should include strategy around people and processes too. It’s about building a modern data architecture to ensure your data is flowing through your organization and that all your business users who need it have access to it when they need it.

What is Data Architecture and Why is It Important?

Generally speaking, data architecture is the plan and design for the entire data lifecycle for an organization, all the way from when data is captured to when value is generated from data through analytics. When we hear the phrase modern data architecture in the marketplace discussion, it tends to be more specific and targeted to a particular solution and particular technologies as opposed to describing the entire discipline of data architecture.

Data architecture is what allows organizations to maximize the value that they create from data with analytics by executing on the vision of a comprehensive data strategy that is connected to business goals and focused on people, process, technology, and data. Without a sound data architecture, organizations are limited in the impact, scope, scale, and breadth of analysis that they’re capable of.

What Makes an Approach to Data Architecture Modern?

Let’s be honest—the term “modern” is vague and unclear. What is modern today is certainly not guaranteed to be modern tomorrow, and anyone can custom fit their definition of “modern” to fit their purposes. The only guarantee is that there will always be more modern software to buy and another new approach to try out. Before you fully buy-in to modern data stack software marketing, it is important to acknowledge that every organization has different business and data problems, and therefore everyone should not adopt the exact same architecture or stack.

Common characteristics of modern data problems include:

Different types of data—including relational structured data, semi-structured data (JSON, XML, etc.), and unstructured data—is making it difficult for organizations to integrate the data, forcing them to adopt new technologies and processes to help address the issue.
Ever-increasing volumes of data is making it difficult for organizations to keep pace with managing big data, and in turn optimizing on its value.
Increasing complexity in disparate data requires organizations to procure more and more operational systems and then require that the data from them be integrated for analytics purposes.
Citizen data analysts, citizen data engineers, and citizen data scientists have made it possible for organizations to grow in their lay peoples’ data capabilities, but that also increases risk of people coming to different results and different conclusions against the same data.
Cloud, on-prem, and hybrid options allow organizations to store data and systems where they want, but it requires a data strategy to bring it all together securely.
The mix of new and old technology means organizations must find ways to build teams that can support the new technologies and coding languages in addition to supporting the existing systems and processes that have been around for decades.

Data Lake vs Data Warehouse: What is the Value of a Data Warehouse and Data Lake in your Data Architecture?

In my experience, there are two key concepts in modern data architecture that help address the modern data problems above. These two terms are grossly misunderstood: the data warehouse and the data lake. Understanding these concepts and how to apply them is one of the keys to unlocking the secret sauce to making your data estate capable and ready to overcome your modern data problems.

What is a Data Warehouse?

The data warehouse is no longer cool. Some are claiming it is dead (though some people say dashboards are dead, dimensional modeling is dead, etc., etc.). There are many reasons for this, but I think the main reason is that it seems like everyone has tried to build a data warehouse, and it did not go well. When we see the phrase “enterprise data warehouse” at a client site, my confidence level in it being a business-accepted, performant, and useful component for data across the enterprise is basically zero. There is a reason why popular cloud technology moved away from branding as a cloud data warehouse to instead a cloud data platform. The term data warehouse comes with all sorts of baggage.

However, you should be willing to reset your perceptions and feelings toward what you think a data warehouse is, because a fit-for-purpose data warehouse is mission critical to any organization that is doing analytics broadly at-scale.

A data warehouse is a highly governed data estate where raw data is translated into reliable, consistent, and quality-rich information. A data warehouse:

Is accepted by the business as the source of truth.
Is modeled and architected after business processes and not a particular source system.
Is rocket-fast and allows a savvy data-literate person to span disparate business processes seamlessly with basic knowledge of a database query language.
Has denormalized data for ease of use and performance improvement and is updated frequently enough to answer important business questions in a timely manner.
Allows you to get down to the lowest possible detail of each business process to understand why things are the way they are.

If all of the above isn’t true about the thing you’re calling a data warehouse, it is time to revisit your data warehouse strategy.

What is a Data Lake?

Most see a data lake as the main differentiator of a modern data stack, but that really depends on what you mean by a data lake. When we use the term data lake, we are referring to a central repository that accepts relational, structured, semi-structured, and non-structured data types in a low-to-no modeling framework. Think about it as a well-organized storage facility operating under the principle of incremental replication of all data assets in their rawest form. The main goal of your data lake should be to drive ease of consumption and centralization of data assets. A data lake:

Reduces risk for an organization when it supports data replication and storage in a way where people no longer need to directly query the operational business systems to get answers.
Presents new opportunities for an organization when it makes hard-to-access data available to the right audiences in a one-stop-shop.

It is important though to think of a data lake as a concept rather than a specific technology. Something is a data lake if it supports the conceptual requirements listed above. Do not think that if you simply buy a certain technology that you have created a data lake and are now adopting a modern data architecture. It is only when you have built a centralized replication data repository that creates less governed pathways to raw data that you have achieved building the concept of a data lake.

The Data Warehouse and Data Lake are More than Technologies

Learn More About Data Warehouses and Data Lakes as Concepts

Thinking of the data lake as a concept instead of a technology is helpful because the lines are starting to blur between the technologies that support the concepts of a data warehouse and a data lake. Databricks markets the Delta Lake that smells like a data lake but can be used to build a conceptual data warehouse. Snowflake markets its cloud data platform that smells like a data warehouse but supports the conceptual requirements of a data lake listed earlier.

What matters more than the technology is understanding and applying the critical concepts of a data warehouse and a data lake in your data estate—the tech is getting simpler, but the modern data problems are still growing in complexity.

The most important measurement of your data stack and data architecture is that you can address modern data-related problems—data variety, data volume, disparate sources, governance, and easy central access to data originating from the cloud and from your old on-prem data sources (imagine a world where no one has to figure out how to connect directly to your mainframe anymore). It is no longer a requirement of a modern data stack that your data warehouse and your data lake exist as separate technologies. It is a requirement that you establish the concept of a data lake for less governed pathways to data, and the concept of a data warehouse to establish highly governed pathways to data.

What about streaming? Real-time data? ML? Doesn’t this require specific data lake technology? Your requirements as a business will certainly direct you to specific technology to support the challenges your organization faces with your modern data problems. My point is that not all organizations will require separate technologies to establish these core concepts to embrace modernity.

The true vision of the modern data stack realized is the ability to create highly governed pathways to mission critical information, while granting your innovative and fast-moving teams access to constantly updated replicated data to address business problems quickly. This enables you to respond to valuable business opportunities quickly, while still building a secure, stable, scalable foundation for the future.

View full post