skip to Main Content

Data Warehousing

Last Updated:
Analyst Coverage: and

A data warehouse is a database implementation that supports the storage and analysis of historic data, either for the purpose of exploring what has happened in the past in order to understand that past, or as the basis for predicting what may happen in the future. Whereas a database for transaction processing is essentially a write many times and read many times environment, a data warehouse is a write-once read-many environment. There is therefore a particular emphasis on read performance. Data warehouses also differ from transactional databases in that they typically store much larger amounts of data and therefore scalability is a much bigger issue. That said, so-called HTAP databases merge the capabilities of operational and analytic databases into a single environment – see operational databases.

In practice, there are different types of data warehouse implementation. Most organisations will have a so-called logical data warehouse where a central enterprise data warehouse (EDW) provides a “single source of truth” for important corporate data but data is otherwise distributed across multiple data marts and data lakes, with query processing enabled through data virtualisation. In addition, some data may need to be replicated – see data movement – across databases and there will also be a requirement to synchronise data where that is the case. Ideally, one would also like a tool that told you the best place to store particular data elements.

While data marts are primarily focused on structured data, data lakes provide the ability to store, investigate and analyse data in a wide variety of formats. The latter are typically supported by NoSQL databases or specialised environments such as graph databases.

The difference has historically been seen as the central repository of all relevant data required for analysis purposes; a data warehouse (with or without the tag “enterprise”) supports operational queries (short look-up queries from such things as call centres) as well as business intelligence and analytics; and a Data Mart is used for analysis within a particular domain or department. Complementary to both of these are Data Lakes and specialised repositories such as graph databases and other NoSQL deployments.

Data warehouses, marts and lakes support the use of business intelligence, analytic, statistical and reporting tools that are either used to examine what has happened in the past or (increasingly) predict what is going to happen in the future. Often, a data warehouse will be designed to do both of these. In addition, an Enterprise Data Warehouse may support call centre and similar operatives who need to look-up customer information on a regular basis. This range of requirements puts an onus on the database to include features – known as workload management or, in some cases, workload isolation – that will allow all of these different functions to operate in an efficient manner.

Because of the complexity involved, some suppliers in this space do not target the whole range of data warehousing requirements but specialise in particular subsets thereof. In particular, some vendors focus on data lakes only and, in particular, on supporting complex analytics and statistics that go beyond normal run of the mill business intelligence environments.

Anyone who wants to understand what has been happening in order to inform future strategy, to predict future trends or actions or those interested in detecting and/or preventing nefarious activity of various kinds. Relevant managers and C level executives in any of the following cross-industry areas (amongst many others) should be interested:

  • Customer acquisition and retention
  • Customer up-sell and cross-sell
  • Supply chain optimisation
  • Fraud detection and prevention
  • Telco network analysis
  • Marketing optimisation

Apart from the rise of data lakes and NoSQL databases the biggest trend within this space is towards cloud-based deployments. A variety of offerings are available, whether for private, public or hybrid cloud environments. In some cases, managed services are offered.

Another major trend is the separation of compute from storage, the idea being that neither should be a bottleneck but that you shouldn’t have to pay for extra storage if all you need is more processing power, or vice versa. While a number of vendors – both traditional data warehouse vendors and NoSQL database providers – have introduced such capabilities there are still many that have not.

Finally, moving to a container-based architecture, typically within the context of Kubernetes, is a further significant move that many vendors are taking.

The data warehousing market periodically suffers from disruption and (cloud-based) solutions from companies such as Snowflake, Cazena and Incorta are doing just that. The first of these, in particular, is gaining significant traction though mostly with smaller companies. It remains to be seen how much impact these will have on the major players and it is interesting to note that we are hearing less about Amazon RedShift.

On a different note, Vertica is now part of Micro Focus, thanks to the acquisition of HPE Software and, of course, HortonWorks and Cloudera have “merged”: a combined platform is due for release in summer 2019.

Solutions

  • Actian logo
  • Attunity (logo)
  • biGENiUS (logo)
  • CAZENA logo
  • CLOUDERA logo
  • EXASOL (logo)
  • PIVOTAL GREENPLUM logo
  • IBM (logo)
  • Infoworks (logo)
  • Kx Systems (logo)
  • Magnitude Software (logo)
  • STARBURST logo
  • teradata logo
  • TimeXtender (logo)
  • VERTICA logo
  • WhereScape (logo)
  • YELLOWBRICK logo

These organisations are also known to offer solutions:

  • 1010Data
  • Amazon
  • BIReady
  • Broadcom
  • Calpont
  • Databricks
  • Dataupia
  • Incorta
  • Infobright
  • Kalido
  • Kognitio
  • Micro Focus
  • Microsoft
  • Oracle
  • Pivotal
  • Sand Technologies
  • Snowflake
  • Tokutek
  • VectorNova
  • XtremeData
The cover of SQL Engines on Hadoop

SQL Engines on Hadoop

There are many SQL on Hadoop engines, but they are suited to different use cases: this report considers which engines are best for which sets of requirements.
Cover for IBM Private Cloud (InDetail)

IBM Cloud Pak for Data 1.2

Limited, or no, technological capability with respect to AI is holding many companies back. This paper discusses how IBM Cloud Pak for Data can help.
IBM InBrief cover thumbnail

IBM Db2 Event Store

IBM Db2 Event Store is an in-memory database built on top of Apache Spark, intended to support both near real-time and deep analytics on historic data.
Cover for the Attunity Compose InBrief

Attunity Compose

Attunity Compose is a data warehouse automation solution that can automatically create analytics-ready data structures on both data warehouses and data lakes.
GREENPLUM InBrief cover thumbnail

Greenplum Database

Greenplum is a massively parallel shared-nothing data warehouse based on a PostgreSQL kernel.
00002532 - DIMENSIONAL ANALYTICS Spotlight cover thumbnail

Dimensional Analytics

This paper explores “dimensional analytics”, by which we mean analytics that requires an understanding of the dimensions of time and space.
STARBURST InBrief cover thumbnail

Starburst Presto

Starburst Data provides commercial support for Apache Presto as well as Starburst Enterprise.
Cover for TCO for Business Intelligence

TCO for Business Intelligence

This paper initially started as an investigation into comparative pricing for BI solutions and then evolved into a consideration of TCO for such environments.
Back To Top