Today’s business leaders are well aware that enterprises able to leverage data and analytics as a basis for decision-making will outperform peers that don’t. In fact, recent research conducted by PwC reveals that highly data-driven organizations are three times more likely to achieve better business results — including enhanced productivity and greater profitability.
But wanting to use data to answer pressing, business-relevant questions and actually being able to do so consistently, affordably and quickly are two very different things. Data warehouses are among the data management and storage architectures that were developed to support data-driven decision-making at enterprise scale.
What is a Data Warehouse?
A data warehouse is an information collection and processing system designed to power business intelligence (BI) and decision support solutions. Data warehouses integrate data from multiple, disparate sources into a centralized repository where it can serve as a single source of truth for the organization. They may comprise multiple technologies and components, but all are designed to aid in transforming data into something that’s usable for analysis, reporting and strategic decision-making.
Data warehouses first came into being in the late 1980s in response to performance challenges that arose within online transaction processing (OLTP) databases in widespread use. These databases were excellent for handling high volumes of transaction-oriented tasks, providing fast query response times and staying accurate despite frequent updates. But they weren’t very good at the large batch-read operations necessary to support complex or long-running analytics.
The problem was that business analytics applications required entirely different things of the database systems that “fed” them. There was far less need for frequent writing to the database. Instead, writes happened once daily or even less often. But the databases used for analytic purposes had to be able to support large-scale aggregated read and processing functions. They needed to be capable of scanning millions of individual records quickly to surface the answers to business questions. And they needed to be well-suited for retaining large amounts of historical data.
Data warehouses were built to solve this problem. The data flows into the warehouse from multiple operational sources. There, it’s prepared, cleaned and stored so that it’s ready to fuel the analyses, reports and dashboards that modern BI engines generate. Data warehouses can contain multiple databases as well as data ingestion and transformation tools, but their primary purpose is always to support data analysis.
When Do You Need a Data Warehouse?
The key driver for most businesses building data warehouses is the need to support analytics platforms like Tableau or Looker. These platforms execute frequent and complex queries that can put a lot of stress on a database, and it’s risky to subject operational databases to those stresses. Instead, data warehouses are modeled specifically for consumption by BI engines, and they’re up to the task.
Modern cloud data warehouses are more flexible and much easier to manage than data warehouses hosted on legacy on-premises hardware. They can readily be integrated with cloud data lakes, which provide more general-purpose storage, holding large volumes of unstructured or semi-structured data before it’s prepared for analytic use in the data warehouse. Many organizations that start by building a data lake in the cloud find that they need a data warehouse to support more complex analytic capabilities.
Data Warehouse Best Practices
No. 1: Begin with solid master data management (MDM) practices.
MDM focuses on building a controlled process through which correct, consistent and validated master data is created and established as the system of record for the enterprise. The primary challenge in MDM is ensuring that reliable and accurate master data is feeding the data warehouse. You’ll need to ascertain that data quality is maintained across all of your data sources, that no records are being lost or deleted as they’re moved into the warehouse, and that you’re tracking data source anomalies. Done well, this removes much of the transformation effort involved in populating your warehouse.
No. 2: Invest time and effort into data standardization.
Imagine that your enterprise makes use of five different claims processing systems across different business units. You’ll need to aggregate and normalize the data from all of them in order to integrate them into your data warehouse. It requires engineering effort to harmonize disparate systems so that you can report on them in a consistent way, but it’s a worthwhile endeavor. Creating a common data format will enable you to eliminate inconsistencies in data formats, schemas and structures so that your data can be analyzed reliably and consistently.
No. 3: Build stable extract, transform and load (ETL) or extract, load and transform (ELT) pipelines.
ETL/ELT pipelines are like a plumbing system for your data warehouse. The traditional ETL process involved collecting volumes of event or transactional data in a staging area, then cleaning and standardizing the data, and finally loading it into the data warehouse. ELT is a newer process that relies on the capabilities of modern cloud data warehouses. They’re able to transform the data in place, once it’s inside the target system. Whichever process they use, most enterprises now rely on purpose-built tools to automate extraction and ensure that data flows are efficient and reliable.
No. 4: Plan ahead for how you’ll define permissions and access controls.
When you’re aggregating data from a large number of sources, it’s critical to review the security needs of every one of them. The original sources might have enforced various types of field-level security: how will you maintain similar controls within the data warehouse? And, how can you ensure that data security best practices and compliance requirements will continue to be met? Think through your privilege management and access control strategy with care.
No. 5: Figure out how you’ll maintain observability into the data warehouse.
Data warehouses are inherently complex entities. They involve large numbers of external sources, and it’s possible for something to go wrong with any one of them. Establish a logging, monitoring and alerting infrastructure that will let you know when it does. Observability is a critical issue with data warehouses, so you’ll need to build a system that lets you know what’s failing, what happened, and what didn’t happen at any given time.
Today’s cloud data warehouse solutions bring all the benefits of cloud – elasticity, scalability, high availability – to data warehousing. They’re secure, reliable and fast to query. Cloud offerings integrate seamlessly with a wide variety of commonly-used business applications, provide easy self-service capabilities for users, and eliminate hardware costs and administrative burden. Most include integrated query engines and pipeline tools, and some can connect with data stored anywhere in the provider’s infrastructure.
For more information on how Cloudreach can help you prepare your business to harness the power of its data and become more data-led, click here.