Implementing an enterprise data warehouse (EDW) can be a great way to support your digital transformation journey. According to a Gartner survey, 72 percent of data and analytics leaders at enterprises are leading or involved in digital transformation initiatives. However, hiring data leaders isn’t the only way to digitally transform and become a data-powered organization. The Data and AI Leadership Executive Survey 2022 by NewVantage Partners revealed the following discrepancy:
73.7 percent of surveyed executives have appointed Chief Data Officer (CDO) or Chief Analytics Officer (CAO) at their organizations
Only 26.5 percent of executives report that their organizations have become data-driven
This means that 73.7 percent of organizations are convinced they need to enhance data management capabilities with the help of a strong leader. But appointing a CDO or CAO still doesn’t guarantee that a company will become data-driven. Implementing data management and data warehousing technology solutions as well as emphasizing the importance of making data-driven decisions for all employees can help you transform into a data-driven company.
One of the most important benefits of an enterprise data warehouse (EDW) is that it fosters data-driven decisions, helping your managers draw on reliable information when making decisions.
In this post, we’ll first talk about EDWs in general, then cover a detailed step-by-step process to develop an efficient EDW as well as discuss in detail the major components of its development. Let’s begin with a general overview of enterprise data warehouses.
Read also: How we aggregated customer financial data for a wealth management platform
Enterprise data warehouse development: definition, problems to solve, and real-life scenarios
Enterprise data warehouse design was introduced by computer scientist B. Inmon and reveals his attitude to business data management. Let’s take Inmon’s definition of an EDW from his book “Building the data warehouse”: An EDW is a subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decision-making process. It’s the data footprint of your business processes. An EDW design is defined by the way you look at your operations from a data perspective.
With the help of an EDW, it’s possible to set your own terms for data aggregation, data granularity, and data partitioning. You can define low (more detail) or high (less detail) granularity criteria for different datasets. And for optimized data storage, you can also partition datasets into smaller ones.
Issues to solve with an EDW:
Chaotic data management. Enterprise data warehouse services allow organizations to implement a structured approach to data storage and, as a result, data analysis. In simple terms, with a clear request, you can quickly find any data you need in an EDW.
Cumbersome access to different datasets. With an EDW, you won’t need to maintain multiple data access policies. Once all of your enterprise data is consolidated in the EDW, you’ll only need to manage access rights to it.
Error-prone and time-consuming reporting and analytics. Collecting different data among scattered corporate databases complicates and significantly prolongs report generation. Thus, data analysts can potentially make mistakes during this process. With an EDW, analysts can quickly access necessary (and, most importantly, accurate) data to generate correct and efficient reports.
Data duplication. Since data in an EDW is normalized and transformed for future analysis, there’s minimal data duplication, and consequently, data analysts need less time to generate quality reports.
Read also: How we developed a data processing and reporting system
The first and biggest reason for building a data warehouse from the get-go is to make it perfectly fit your current business model. Especially if your company is large and you need to keep track of the data flow across multiple departments or even facilities like a hospital network. It’s worth mentioning that an EDW can be part of your bigger enterprise data management projects.
For example, a large Helsinki University Hospital (HUS) with more than 17 hospital locations has incrementally adopted enterprise data warehouse solutions as part of a much bigger project called a “big data platform.” The company needed a single source of truth for structured data like financial data and unstructured data like medical images. You can also learn how we’ve helped the US 3PL company generate more insights and better manage a big data flow thanks to our big data analytics solution.
At the initial phase of this project, they developed an EDW to collect and store structured data. Thus, an EDW turned out as an essential building block of a much bigger enterprise-wide data platform. And that’s one of the benefits of an enterprise data warehouse, as it can integrate with other data repository solutions in your organization, such as a data lake, to form a unique data platform that perfectly fits your business needs. Since the initial implementation of its EDW, HUS has been gradually expanding this data platform to better fit its changing business environment.
Yalantis has expertise in implementing a data lake. Read our case study on how we implemented a data lake for a manufacturing company.
Read also: Centralized repository for your data: data mart, data warehouse, or data lake?
Now we invite you to compare life with an EDW to life without:
If the above infographic has inspired you, continue reading to learn about building a data warehouse from scratch. A digital warehouse can become an end-to-end business solution by means of chaining all of your corporate data sources.
Read also: Our big data and analytics expertise
Building a data warehouse step by step
Implementing a data warehouse is, luckily, a gradual process like digital transformation and doesn’t require you to make huge investments at once. Below is a list of steps you should take along with your IT engineers as part of the full-cycle custom development process to successfully implement an EDW into your daily workflow. Further, we’ll discuss some of the most crucial elements from this list in greater detail.
Step 1. Define business requirements
Drawing on your business needs, compose a detailed list with prioritized functional and non-functional business requirements, as they will impact your choice of data warehousing solutions. For instance, if your business environment is constantly changing, then one of your non-functional requirements should be flexibility. If in the near future you want your company to change and grow, your priority should be scalability. Consider working with a skilled solution architect to properly define and handle your business requirements. Read our blog article to find out how an architect can do this.
Step 2. Analyze source data
To make sure you’ll load into your data warehouse only relevant and accurate data, you should define all available data sources. It’s also important to determine systems of record to avoid loading unnecessary datasets in a data warehouse, as certain data may be part of a few storage systems. For example, sales order information can be streamed from your order management system (OMS) to logistics software for your operational needs. Still, the OMS remains your source of truth, as logistics software may manipulate the data and distort potential insights.
Step 3. Build conceptual, logical, and physical data models
After you’ve defined your business requirements, it’s time to build a preliminary enterprise data warehouse model, as it will help you schematically visualize your key business processes, meaning your business entities and the way they interact with each other. It’s crucial to build those models with the support of domain experts, as each industry has specific business processes.
A conceptual data model is necessary to establish clear relationships between the main business entities (concepts) to define an enterprise’s information needs. For example, a supply chain company may distinguish business entities including shippers, carriers, suppliers, products, orders, customers, and manufacturers.
A logical data model is a detailed version of a conceptual model. The logical model provides attributes (columns) related to a certain business entity, such as a carrier’s country for a carrier entity.
A physical data model is even more detailed, as it adds more attributes to entities like primary and foreign keys. A primary key is a unique identifier, and there can be only one in a table; a foreign key is created to link two tables (entities). Practically speaking, a foreign key is a primary key from one table inserted into another.
As the data model reflects your business operations and modern businesses are in the process of constant change, you need to ensure that your data model can be easily adjusted as your business evolves. Business changes may cause significant updates to the data landscape, so involving a qualified information architect is necessary to make sure the data model is flexible enough.
Step 4. Identify and build a data warehouse schema
This step is a logical continuation of the previous. Once you have clear models of your business processes, you can proceed with arranging your final and detailed version of data modeling — the physical model — into a data warehousing schema. There are many types of schemas and methods to generate them automatically. Your software architect can help you choose the best approach. To give you an overall understanding, we’ll review a few of them in the next section — in particular, the star schema, snowflake schema, and data vault schema.
Step 5. Incrementally implement a data warehouse architecture
With a fitting data warehouse schema, you can compose an enterprise data warehouse architecture. There are several types of enterprise data warehouses. Go on reading to find out which you can consider. Taking into account the topic of our article, we’ll primarily discuss the enterprise data warehouse architecture.
Read also: Our expertise in developing custom data science solutions
Enterprise data warehouse schemas and architecture designs
In this section, we’ll drill deeper into the topic of data warehouse schemas and architecture types.
Data warehouse schemas
In this section, we’ll cover three data warehouse schemas: the star schema, snowflake schema, and data vault schema. A schema establishes relationships between datasets in a data warehouse to make them ready for querying and further analysis.
However, certain companies choose not to build any schemas but rather compose a set of flat tables for loading multiple data columns into one table. This approach doesn’t take into account data granularity and the difficulty of making changes to tables. Therefore, schema designs have appeared, as they provide more flexibility and a better data structure.
This is the most common and traditional way to build a data warehouse. A star schema consists of fact and dimension tables. To better understand how it’s built and its purpose, let’s take as an example a typical star schema for a healthcare company.
The table in the middle is a fact table that contains keys to define each dimension table. Each dimension table, in turn, contains dimension attributes. For example, a dimension table for a patient may consist of such attributes as patient name, address, age, sex, and insurance ID.
Pros of a star schema:
Easy to design and set up due to the simple relationship between datasets
Data analysts can quickly query data thanks to simplified join logic between the fact table and dimension tables
Cons of a star schema:
Less flexibility compared to other schemas and not suitable for complex analytical queries
High rate of data redundancy and duplication
Another type of data warehousing schema is the snowflake schema. In practice, the snowflake schema is an enhanced and normalized variant of the star schema. The snowflake schema also contains fact tables, but its dimension tables get split into multiple additional dimension tables. Let’s take the same example of a healthcare star schema and transform it into a snowflake schema.
We’ve made a very simplified snowflake schema, but it still shows its essence. We’ve only added one additional dimension table to each of the initial tables, but there can be more than one. That’s why a snowflake schema can become rather complex.
Pros of a snowflake schema:
Eliminates data redundancy and duplication, as each dimension attribute gets its own table and won’t get mixed up with similar attributes
Provides more data analytics possibilities than a star schema due to more elaborate relationships between datasets
Cons of a snowflake schema:
Harder to design and maintain snowflake schemas compared to star schemas
Complex arrangement of datasets within a snowflake schema complicates data queries, making them take longer than with a star schema
A data vault is a hybrid approach to data warehouse design that combines the best of the star schema and the third normal form (3NF) (another type of normalized data schema). Data vault inventor Dan Lindstedt says that a data vault is perfectly suitable for modern enterprise data warehouse solutions. Currently, the most recent version is data vault 2.0.
A data vault 2.0 contains raw and business vaults; a raw vault keeps all the raw data within the enterprise along with data duplicates, while the business vault is a layer on top of the raw vault that consists of business rules and calculations. The raw vault consists of three components:
Hubs are entities that include business keys (one key is used to identify one record). If there is no direct business key available, hubs also have surrogate keys to identify their data objects.
Links are connectors between different hubs and build relationships between business keys.
Satellites are descriptive tables associated with hubs. Each hub or link may have one or more child satellites. Satellites also contain metadata linking them with their parent hub or link.
Hubs and links are more structural elements of a data vault model, while satellites provide context for the business processes captured in the hubs and links. The data vault is a more complex and scalable model than the star and snowflake schemas. Below is a simplified example of customer and order data distribution in the data vault model.
Pros of a data vault:
Flexible and scalable in terms of adding or removing data sources (can scale to hundreds of terabytes or petabytes)
Quick and easy to automate an ETL process with a data vault and simultaneously download data to hubs, links, and satellites
Data vault modeling technique allows for spotting business issues that weren’t obvious before thanks to multiple data joins
Cons of a data vault:
Not suitable for static data with little to no changes
Not optimized for smooth query performance and often requires building data marts on top as a presentation layer
To make the right choice of data warehousing tools, you should consider the technology stack and the tools you’ll use to analyze your business data. For example, the Tableau business intelligence tool works with simple flat tables, as it’s hard for this system to process many data joins as in star and snowflake schemas. Power BI can easily process star schemas.
Data warehouse architecture types
The data warehouse development process is incremental and ongoing. According to Inmon, it can look the following way:
This process doesn’t necessarily happen in a matter of days, and the above schema is rather simplified and sped up. At first, you implement one subject area for your data warehouse and have a small number of analysts who can use this data for reporting purposes.
Then, you add more subject areas and more people get access to data. Afterward, you start building a few data marts on top of your EDW to organize data use and achieve more purposeful data queries. After that, you can scale even further with more data marts and data consumers. All in all, you scale your EDW as long as your business scales, your data volume grows, and your information needs across the enterprise increase.
The above schema by Inmon can be split into three data warehouse development strategies:
This architecture roughly corresponds to Inmon’s day one of a data warehouse implementation. It’s hardly possible, however, to unleash full enterprise data warehousing potential with a basic architecture only. At this stage, you simply configure the data flow from your corporate data sources to a new entity — enterprise data warehouse software.
Architecture with a staging area
This architecture already suggests setting up ETL (extract, transform, load) logic with the help of inserting a staging area. At this point in your EDW implementation, you start loading business data from your systems of record into the data warehouse through a staging area that cleans and transforms data to fit it into the necessary place in the data warehouse and in the necessary format.
Architecture with a staging area and data marts
To simplify data search and allow different departments to simultaneously access only those datasets they need for their business purposes, you can also build data marts on top of your data warehouse.
The end result is a layered enterprise data warehouse architecture that fully integrates into your enterprise data management framework:
The ingestion layer is a staging area and is responsible for preparing data before loading it into the data warehouse.
The storage layer consists of the data warehouse itself and data marts, which store data and keep it in the right places and formats so it’s ready for analytical purposes.
The presentation layer allows for retrieving the necessary data using all kinds of business intelligence tools and web services to visualize, analyze, and view it as insights rather than mere tables and columns.
Now that we’ve covered typical data warehousing architectural patterns, let’s move to implementing cloud data warehousing.
Cloud services to implement an EDW infrastructure
Implementing a data warehouse in the cloud is a common approach, as it has a number of advantages:
Ready solutions. A cloud environment is already established and requires only scaling, configuration, and customization to align with your goals.
Elasticity. Depending on your business needs, you can either expand storage space or decrease it, and such manipulations won’t require big investments, unlike with an on-premises environment.
Horizontal integration with different services. Cloud services easily integrate with other solutions. For example, cloud storage solutions can integrate with cloud solutions for data analysis.
No need to maintain IT staff. For an on-premises implementation, you have to maintain a large IT team to be in charge of all hardware and software. To maintain a cloud data warehouse, you need significantly fewer people.
Lower capital expenses and minimal operating expenses. This benefit is the ultimate result of all the above-mentioned benefits. To implement a cloud-based data warehouse solution, you don’t have to purchase any expensive hardware or software. You spend as you go and keep your budget flexible.
All of the major cloud providers offer solutions for data management, positioning themselves as enterprise data warehouse vendors. We would like to point out Amazon Web Services (AWS) and, in particular, Amazon Redshift. Redshift is a service specifically designed for data warehousing. Its biggest advantage is data integration with a data lake, allowing you to ensure a seamless information flow across the whole enterprise in real time. Amazon Redshift provides vast data analytics capabilities and introduces a whole new notion — a data lakehouse. The data lakehouse combines data from your data lake and a data warehouse, allowing you to achieve new heights in enterprise data analytics.
Even though the benefits of cloud computing sound compelling, there’s still a question: Is it secure enough to store data in the cloud? The answer is a definite “yes.” Cloud services provide reliable data encryption mechanisms that encrypt data during transfer and storage. And the keys used to decrypt the data are stored outside the cloud environment. This way cloud services provide a high level of data protection even for the most sensitive data.
Vendors of cloud data warehousing services also meet most common compliance requirements. For instance, AWS and Microsoft Azure meet the requirements of HIPAA, PCI DSS, ISO 27001, and FedRAMP.
Of course, you should make your choice of implementation environment based on your business and industry needs and requirements. It might even be the case that a hybrid environment with both an on-premises and cloud implementation is the best option for you.
Read also: Secure application development from planning to production
Let’s switch from discussion and considerations to real actions. If developing and implementing an EDW have been on your mind and business agenda for a while, it’s time to get started. You can develop and implement an EDW steadily and gradually. And we can be by your side during the whole process. Remember those steps to develop a data warehouse at the beginning of the article? You can take all of them alongside our domain experts and technical specialists. This way, we’ll ensure a custom approach to meeting your current data management issues.
Ready for an EDW?
Let us help.