A data warehouse is a database implementation that supports the storage and analysis of historic data, either for the purpose of exploring what has happened in the past in order to understand that past, or as the basis for predicting what may happen in the future.
Whereas a database for transaction processing is essentially a write many times and read many times environment, a data warehouse is a write-once read-many environment. There is therefore a particular emphasis on read performance. Data warehouses also differ from transactional databases in that they typically store much larger amounts of data and therefore scalability is a much bigger issue.
In practice, there are different types of data warehouse implementation. An Enterprise Data Warehouse has historically been seen as the central repository of all relevant data required for analysis purposes; a data warehouse (with or without the tag "enterprise") supports operational queries (short look-up queries from such things as call centres) as well as business intelligence and analytics; and a Data Mart is used for analysis within a particular domain or department. The widespread deployment of data marts has historically meant that the control implied by the use of an Enterprise Data Warehouse has been difficult to attain.
Data warehouses and marts support the use of business intelligence, analytic, statistical and reporting tools that are either used to examine what has happened in the past or (increasingly) predict what is going to happen in the future. Often, a data warehouse will be designed to do both of these. In addition, an Enterprise Data Warehouse—but not a Data Mart—may support call centre and similar operatives who need to look-up customer information on a regular basis. This range of requirements puts an onus on the database to include features that will allow all of these different functions to operate in an efficient manner.
Because of the complexity involved, some suppliers in this space do not target the whole range of data warehousing requirements but specialise in particular subsets thereof. In particular, some vendors focus on data marts only and, in particular, on supporting complex analytics and statistics that go beyond normal run of the mill business intelligence environments.
Anyone who wants to understand what has been happening in order to inform future strategy, to predict future trends or actions or those interested in detecting and/or preventing nefarious activity of various kinds. Relevant managers and C level executives in any of the following cross-industry areas (amongst many others) should be interested:
- Customer acquisition and retention
- Customer up-sell and cross-sell
- Supply chain optimisation
- Fraud detection and prevention
- Telco network analysis
- Marketing optimisation
Apart from big data the major trend in the marketplace is away from the concept of a single Enterprise Data Warehouse that stores a 'golden copy of the truth', which is surrounded by data marts that are linked back to the Enterprise Data Warehouse. While fine in theory it is now increasingly recognised that this is impractical in practice, at least in organisations of any size. The concept has therefore arisen of the "logical data warehouse". This is essentially an Enterprise Data Warehouse where the data is not in a single place but is distributed across multiple (heterogeneous) systems, including both data warehouses and data marts.
Enabling the logical data warehouse requires an understanding of where data is located throughout the environment and the ability to query across data repositories. Data virtualisation provides this capability. In addition, some data may need to be replicated across databases and there will also be a requirement to synchronise data where that is the case. Ideally, one would also like a tool that told you the best place to store particular data elements. There is no single suite of tools that will currently enable a logical data warehouse. It is likely that we will see support for homogeneous logical data warehouses before heterogeneous support.
The second significant trend within the data warehousing market is the use of in-memory and flash or solid state disks to support the functioning of the underlying database. While such facilities will improve performance they do not generally impact on the capabilities of the solution per se.
Over the last couple of years there has been a significant consolidation of the market: IBM acquired Netezza, Microsoft acquired DATAllegro, HP dropped NeoView and bought Vertica, Teradata is now the owner of Aster Data, and EMC acquired Greenplum (now Pivotal). It remains to be seen how successful HP and Pivotal are in this space, given that neither of them has a proven track record of selling software.
IBM is distinguishing between its traditional offerings and Netezza by marketing the former as for operational analytics (that is, as a traditional data warehouse) and the latter for analytics (that is, as a data mart). Meanwhile, Oracle has released Exadata X3 and Oracle 12c is now available.
Almost all vendors now offer massively parallel solutions and those that don't (notably Infobright and Actian) are on the verge of announcing relevant products. One vendor, illuminate, appears to have gone out of business. Amazon has recently announced its cloud-based warehousing offering, which is based on ParAccel (now acquired by Actian) technology. This acquisition of ParAccel must raise major doubts over its future, given that Vectorwise (the Actian data warehousing platform) and ParAccel were previously competitors.
Needless to say, all the vendors have jumped onto the big data bandwagon in one way or another and while big data solutions may be separate from data warehousing they will also often be complementary. Thus the ability to integrate closely with relevant big data solutions will be important.