Big Data refers to the ability to analyse any type of data and not just the relational data that is typically analysed in data warehouses. This typically means instrumented or sensor-based data (sometimes called machine generated data) on the one hand and text, video, audio and similar media types on the other. Both of these types of data have the potential to dwarf the relational (transactional) data in terms of the quantities of such data that are generated and available for analysis: hence the term "big".
Big data technologies are also noteworthy because they are often inexpensive. Many (not all) can be implemented across low-cost commodity servers, which makes the storage of large amounts of data a much more realistic proposition, from a financial point of view, than it was previously.
Big data technologies are essentially extensions to a data warehousing environment although there are some exceptions, notably where there are also operational (usually real-time) requirements. As such, big data provides exactly the same sorts of business intelligence and analytic functionality.
These extensions are usually implemented at the back-end of the data processing environment alongside the data warehouse or mart but, where there are very high volumes of data that need to be processed in a very short time, then the big data solution may be implemented prior to storing the data in a data warehouse. These latter solutions may use Complex Event Processing also known as (event) stream processing or there are big data solutions (for example, based on Cassandra) that may be used for this purpose, the difference being that the former tends to be better when the model being processed is static and the latter when it is fluid.
Because of the low cost of many big data platforms these may also be used for other purposes besides business intelligence and analytics. For example, a number of companies are using Hadoop as a platform for ETL (extract, transform and load—see here) purposes while graph databases may be used for data quality matching and deduplication as well as for exploring relationships.
In general the sorts of users who should care about big data are the same as those who care about data warehousing; that is relevant managers and C level executives who care about such things as:
- Customer acquisition and retention
- Customer up-sell and cross-sell
- Supply chain optimisation
- Fraud detection and prevention
- Telco network analysis
- Marketing optimisation
However, there are additional potential users in areas such as preventative maintenance, smart metering and other sensor-related activities. There is also a significant use of big data within web-based organisations such as online gaming, mobile applications and so on.
Hadoop and its associated tools is currently the 'big beast' of the big data world and the Hadoop environment is undergoing rapid development, especially in areas such as its robustness, manageability and SQL access (though there is not generally a database optimiser present), all of which are currently limited.
Gathering momentum are graph databases (essentially triple stores with an inference engine) and we expect these to grow in popularity as their ability to identify and parse relationships out to 6 or 7 degrees of separation is recognised (a typical relational databases can manage about 3 degrees before performance dies). Graph databases, however, do not run on the low-cost clustered platforms that are otherwise typical of big data solutions, so these are not inexpensive in the same way that, say, Hadoop is.
Longer term, we expect (we know of two already) relational database vendors to implement HDFS (the file system used in Hadoop) as storage engines within their databases. This will combine the low-cost storage advantages of Hadoop with a single management layer that integrates the data warehouse and big data environments.
New vendors continue to enter the market and it is too early for any consolidation. Many, but not all, suppliers offer open source solutions and may have significant venture capital backing but little in the way of revenues. We do not believe that this can continue indefinitely—there are too many vendors and too many products; it is reminiscent of the dot.com bubble. We would advise companies looking at investing in this market to be sure of their due diligence before licensing any particular product, especially if the solution to be adopted will be mission critical (which is often the case with sensor-based environments).
Notable recent announcements have been IBM's new PureData platform based around GPFS (its version of HDFS) and the announcement by InterSystems that you can now use Globals (the Caché database without the development environment that comes with it normally) as a replacement for HDFS under Hadoop. Given how many alternatives there are to HDFS (Cassandra and RainStor to name just two more) there is going to be major guessing game as to whether HDFS will survive and, if not, what will replace it.