There is a lot of noise in the data and analytics industry today about the modern data stack and how to build a modern data architecture to support it. Data-driven organizations understand the value in making big data accessible to all business consumers but find that there is too much information—and hence confusion—around how to modernize and where to start.
Technology vendors have catchy slides showing how they hold the keys to modernization, suggesting if you do not have their product, you are behind the times and behind your competition. But modernization isn’t just a technology play, it should include strategy around people and processes too. It’s about building a modern data architecture to ensure your data is flowing through your organization and that all your business users who need it have access to it when they need it.
Generally speaking, data architecture is the plan and design for the entire data lifecycle for an organization, all the way from when data is captured to when value is generated from data through analytics. When we hear the phrase modern data architecture in the marketplace discussion, it tends to be more specific and targeted to a particular solution and particular technologies as opposed to describing the entire discipline of data architecture.
Data architecture is what allows organizations to maximize the value that they create from data with analytics by executing on the vision of a comprehensive data strategy that is connected to business goals and focused on people, process, technology, and data. Without a sound data architecture, organizations are limited in the impact, scope, scale, and breadth of analysis that they’re capable of.
Let’s be honest—the term “modern” is vague and unclear. What is modern today is certainly not guaranteed to be modern tomorrow, and anyone can custom fit their definition of “modern” to fit their purposes. The only guarantee is that there will always be more modern software to buy and another new approach to try out. Before you fully buy-in to modern data stack software marketing, it is important to acknowledge that every organization has different business and data problems, and therefore everyone should not adopt the exact same architecture or stack.
Common characteristics of modern data problems include:
In my experience, there are two key concepts in modern data architecture that help address the modern data problems above. These two terms are grossly misunderstood: the data warehouse and the data lake. Understanding these concepts and how to apply them is one of the keys to unlocking the secret sauce to making your data estate capable and ready to overcome your modern data problems.
The data warehouse is no longer cool. Some are claiming it is dead (though some people say dashboards are dead, dimensional modeling is dead, etc., etc.). There are many reasons for this, but I think the main reason is that it seems like everyone has tried to build a data warehouse, and it did not go well. When we see the phrase “enterprise data warehouse” at a client site, my confidence level in it being a business-accepted, performant, and useful component for data across the enterprise is basically zero. There is a reason why popular cloud technology moved away from branding as a cloud data warehouse to instead a cloud data platform. The term data warehouse comes with all sorts of baggage.
However, you should be willing to reset your perceptions and feelings toward what you think a data warehouse is, because a fit-for-purpose data warehouse is mission critical to any organization that is doing analytics broadly at-scale.
A data warehouse is a highly governed data estate where raw data is translated into reliable, consistent, and quality-rich information. A data warehouse:
If all of the above isn’t true about the thing you’re calling a data warehouse, it is time to revisit your data warehouse strategy.
Most see a data lake as the main differentiator of a modern data stack, but that really depends on what you mean by a data lake. When we use the term data lake, we are referring to a central repository that accepts relational, structured, semi-structured, and non-structured data types in a low-to-no modeling framework. Think about it as a well-organized storage facility operating under the principle of incremental replication of all data assets in their rawest form. The main goal of your data lake should be to drive ease of consumption and centralization of data assets. A data lake:
It is important though to think of a data lake as a concept rather than a specific technology. Something is a data lake if it supports the conceptual requirements listed above. Do not think that if you simply buy a certain technology that you have created a data lake and are now adopting a modern data architecture. It is only when you have built a centralized replication data repository that creates less governed pathways to raw data that you have achieved building the concept of a data lake.
Thinking of the data lake as a concept instead of a technology is helpful because the lines are starting to blur between the technologies that support the concepts of a data warehouse and a data lake. Databricks markets the Delta Lake that smells like a data lake but can be used to build a conceptual data warehouse. Snowflake markets its cloud data platform that smells like a data warehouse but supports the conceptual requirements of a data lake listed earlier.
What matters more than the technology is understanding and applying the critical concepts of a data warehouse and a data lake in your data estate—the tech is getting simpler, but the modern data problems are still growing in complexity.
The most important measurement of your data stack and data architecture is that you can address modern data-related problems—data variety, data volume, disparate sources, governance, and easy central access to data originating from the cloud and from your old on-prem data sources (imagine a world where no one has to figure out how to connect directly to your mainframe anymore). It is no longer a requirement of a modern data stack that your data warehouse and your data lake exist as separate technologies. It is a requirement that you establish the concept of a data lake for less governed pathways to data, and the concept of a data warehouse to establish highly governed pathways to data.
What about streaming? Real-time data? ML? Doesn’t this require specific data lake technology? Your requirements as a business will certainly direct you to specific technology to support the challenges your organization faces with your modern data problems. My point is that not all organizations will require separate technologies to establish these core concepts to embrace modernity.
The true vision of the modern data stack realized is the ability to create highly governed pathways to mission critical information, while granting your innovative and fast-moving teams access to constantly updated replicated data to address business problems quickly. This enables you to respond to valuable business opportunities quickly, while still building a secure, stable, scalable foundation for the future.