In this blog, we cover:
Businesses have applications that process and store thousands, even millions of transactions each day. The ability to create, retrieve, update, and delete this data is made possible by databases, also referred to as online transaction processing systems (OLTP). While these databases have traditionally been “relational” (SQL Server, Oracle, MySQL, DB2 etc.), in recent times “non-relational databases” (Cassandra, MongoDB, Redis etc.) or files systems (like Hadoop) have been adopted as an alternative for storing raw data.
A data warehouse, also commonly known as an online analytical processing system (OLAP), is a repository of data that is extracted, loaded, and transformed (ELT) from one or more operational source system and modeled to enable data analysis and reporting in your business intelligence (BI) tools.
Data warehouses will help you make better, more informed decisions for many reasons, including:
There are many types of data warehouses, but these are the three most common:
These types of data warehouses are logical entities that may exist within a single physical database.
The data stored in data warehouses and data marts (OLAP) is de-normalized which allows easy aggregation, summarization, and data drill-down. Second, data warehouses enable business users to gain insights into what happened, why it happened, what will happen, and what to do about it.
On the other hand, OLTP systems are single applications built to quickly record specific business processes, like credit card transactions. Also, unlike the de-normalized nature of data warehouses, the data structure for OLTP systems is highly normalized to facilitate data atomicity, consistency isolation, and durability. Due to the complexity in writing queries for analysis in such applications, developers or subject matter experts are most often required for support.
Relational database technologies are generally optimized for OLTP or OLAP use cases. For example, Microsoft SQL Server and Postgres are row-store databases that are optimized for OLTP. Cloud-based MPP (massively parallel processing) databases such as Snowflake and Google BigQuery are column-stores that are optimized for OLAP. It should be noted that while OLTP databases can be used for data warehousing use cases at smaller data volumes, OLAP databases are typically not suitable for transactional use cases.
Since data volumes are growing exponentially, a data warehouse becomes critical, and considerations should be made on the hardware that stores, processes, and provides a medium of data movement. Data warehouses can be stored on-premise, in the cloud, or a mixture of the two environments. Your decision may depend on requirements to keep organization-mission critical applications on-premise. If you are looking into cloud solutions, take into consideration industrial regulations, security, visibility, accessibility, latency, and trustworthiness of the cloud providers. (Some top cloud providers include Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure.)
A data warehouse can be stored on-premise, in the cloud, or a mixture of the two environments.
This is a high-level architecture of an end-to-end data warehouse solution—which includes gather, clean, store, and share layers.
The gather layer consists of various data silos—these can include ERP systems, CRM, Excel spreadsheets, even Access databases housing corporate or divisional data. The storage subsystems used by these applications are typically not structured for easy querying or navigation (if direct access is available at all). Native reporting may be possible in some of these applications; however, functionality is typically very sparse, and reporting is limited to only the data within the single system.
The clean layer applies business logic and other calculations to data that is ultimately going to be made available in the store layer (or data warehouse). Business logic can include custom KPI’s, business-specific calculations or rulesets, data hierarchies, or new derived columns that otherwise are not available from any source system. Data typically only temporarily exists in the clean layer—this layer exists only to create these custom values and pass through to the data warehouse and end user reporting or querying against the clean layer is not allowed.
The store layer represents the denormalized data warehouse that is described further throughout this blog post. While there are several design models, the Kimball approach is a leading design through which information is organized into dimension and fact tables and joined in star schemas for ease of use. The store layer may contain data marts on top of the Kimball star schemas that are optimized for specific downstream use cases.
The share layer isn’t a formal layer as much as a representation of all the various ways data—which resides in the data warehouse—can be consumed. These uses include querying via a BI tool, direct SQL querying, or even automated extracts to feed other, unrelated systems (reverse ETL).
Take a closer look at how information is stored and shared across your enterprise. Inconsistencies in data and reporting, difficulty sharing data, and multiple data sources are all signs that a data warehouse might be the business solution for you. Here are three basic questions to ask and answer if you are still considering:
1.) Do you store data in various source systems?
Gathering data that is structurally different from operational databases, flat files, and legacy systems can be challenging for many organizations. How can you integrate data from disparate systems with different structures? Sometimes it’s done by spending days and weeks pulling data to create reports. Also, operational systems are not modeled for analytical queries and most often contain data specific to a business area. The lack of data integration across multiple business areas might not give a full picture of the health of an organization. In addition, lack of human capital with the relevant or advanced skills and permissions to query data applications can be an impediment when dealing with various source systems.
A data warehouse can make this process efficient and automatic by extracting, loading, and transforming (ELT) data from various source systems in a standardized and consistent manner. The ability to access such data from a central location not only allows users to make quick and better business decisions, but also save time that would’ve been wasted trying to retrieve and blend data from multiple information systems.
2.) Are you experiencing performance issues by reporting against operational systems?
Operational systems have volatile data that changes often. Consequently, running reports directly against such systems with almost real-time data can cause performance problems, and insights gathered might be inconsistent. A data warehouse can solve this problem, because they are usually optimized for read access, resulting in faster report generation.
3.) Do you have a single source of the truth?
Reporting on data that is stored and formatted differently across siloed enterprise information systems results in inconsistency across departments. Well-built data warehouses improve data quality by cleaning up data as it is imported, thus providing more accurate data. This means that one version of the truth can be provided for every department across the enterprise, providing consistency and assurance that each department is using the same data.
Talk to an expert about your data warehouse needs.Sometimes data warehouses cannot solve all business problems due to their inherent dependence on the relational data structures. The adoption of new data sources, such as social media, IoT, logs, video, and audio has resulted in rapid changes in both content and volume. The downside of this has been the lack of internal checkpoints of data ownership which makes it difficult to apply data governance principles accustomed to traditional data warehouse projects.
As an alternative to the challenges brought about by the new ways of storing data, organizations have adopted emerging technologies such as data lakes, data lakehouses, data visualization, non-relational databases, and perhaps polyglot persistence. Let’s have a look at what they offer:
A data lake is a collection of unstructured, semi-structured, and structured data, copied from one or more source systems (technology independent). The data stored is an exact replica of the source. The goal is to make the raw data consumable by highly skilled analyst within an enterprise for future needs that are not known at the time of data capture.
The key difference in comparison to the data warehouse or data mart is that the data is not modeled to a predetermined schema of facts and dimensions. It is the lack of structure that empowers developers, data analysts, or data scientists to create exploratory models, queries, and applications that can be refined endlessly on the fly. Here are three characteristics of data lakes:
A data lakehouse combines elements of both a data lake and a data warehouse within a single platform. Like a data lake, lakehouses store structured, semi-structured, and unstructured data as objects in low-cost storage. Using a metadata layer, data lakehouses contain features typically found in RDBMS (relational database management systems) such as ACID transactions, schema enforcement, indexing and caching, time travel, and access control. Data lakehouses can be accessed through traditional BI tools and APIs.
Databricks is a leader in the data lakehouse space while there are up-and-coming contenders like Dremio. Snowflake also offers features that act functionally similar to a data lakehouse.
Self-service BI is an approach that gives freedom and responsibility to business users to create reports without relying on IT. Sometimes data warehouses lack the agility to scale to meet the need of quickly evolving companies. Self-service solutions allow companies to be nimble by giving departments access to data and information on demand. All skill levels can typically use these types of solutions:
Some of the most popular self-service BI tools available on the market are Qlik, Tableau, Power BI, and Sisense. Looker is a great platform for self-service BI, however, it requires a data warehouse to already be in place and cannot blend data from multiple disparate sources like the other BI tools.
NoSQL is an architectural design approach to databases that does not rely on traditional representation of data. Relational database management systems organize data in tables, columns, rows, or schemas for CRUD (create, read, update, and delete) operations. In comparison, NoSQL databases do not rely on relational structures, but more flexible data models that offer speed, scalability, and flexibly. These are key characteristics needed to deal with “Big Data”.
There are various types of NoSQL databases available on the market today and they fall into four main categories:
If you have settled on implementing a data warehouse, here are some things to consider for a successful project:
1.) Start with skills. One of the key reasons a data warehouse projects go awry is talent deficit. A data warehouse project requires experienced project members, so be sure to assess the skills of your team. Are they skilled in data integration and modeling? Are they trained on new technologies and approaches? Have they worked on similar projects, both in domain and scale? Do you have team leads who are capable of mentoring and guiding less skilled staff?
2.) Identify business requirements with corporate and departmental objectives in mind. At this stage, it is not about the data; it’s about identifying business needs to operate more efficiently and make data-driven decisions. Another reason data warehouse projects fail is because requirements do not address the business objectives; instead, they are created to demonstrate progress and complexity of the project.
3.) Assess data requirements. With analytics requirements in hand, identify the sources of data needed to achieve each requirement. Asses the quality of the data sources available and identify any data remediation that may be required for each source. Compile a data warehouse Bus Matrix and conceptual data model—both will become core elements of your data warehouse requirements.
4.) Assess the Bus Matrix and create a roadmap. Use the Bus Matrix to help prioritize data sources. Take your highest priority analytic requirements and identify all required sources. Create an incremental roadmap that delivers the highest value analytics first. Each increment in the roadmap should be manageable in scope. If the scope is too big right off the starting line, reprioritize so that you can implement low effort-high value items first.
5.) Address the architecture. Identify a technology stack that will meet your long-term business needs. A successful data warehouse should have a lifespan of potentially many years. Plan to build out the skillset necessary to run and operate the data warehouse or select a technology stack you’re familiar with.
6.) Manage to completion. Each phase of the roadmap should be delivered to completion as if it were the last step in the roadmap. Failing to do so will affect later phases and sets a precedent that “done” doesn’t mean “complete”.
7.) Measure success and communicate it. Each phase of the data warehouse project should be creating value. Define, measure, and communicate the value. A project that is delivering incremental value will create momentum and increase executive sponsorship.
Thank you. Check your email for details on your request.