Analyst Coverage: Philip Howard
Big Data refers to the ability to analyse any type of data and not just the relational data that is usually analysed in data warehouses. This typically means instrumented or sensor-based data (sometimes called machine generated data) on the one hand and text, video, audio and similar media types on the other. Both of these types of data have the potential to dwarf the relational (transactional) data in terms of the quantities of such data that are generated and available for analysis: hence the term “big”.
Big data technologies are also noteworthy because they are often inexpensive. Many (not all) can be implemented across low-cost commodity servers, which makes the storage of large amounts of data a much more realistic proposition, from a financial point of view, than it was previously.
Big data technologies are essentially extensions to a data warehousing environment although there are some exceptions, notably where there are also operational (usually real-time) requirements. As such, big data provides exactly the same sorts of business intelligence and analytic functionality.
These extensions are usually implemented at the back-end of the data processing environment alongside the data warehouse or mart but, where there are very high volumes of data that need to be processed in a very short time, then the big data solution may be implemented prior to storing the data in a data warehouse. These latter solutions may use Complex Event Processing, also known as (event) stream processing, or there are big data solutions (for example, based on Cassandra) that may be used for this purpose, the difference being that the former tends to be better when the model being processed is static and the latter when it is fluid.
Because of the low cost of many big data platforms these may also be used for other purposes besides business intelligence and analytics. For example, a number of companies are using Hadoop as a platform for ETL (extract, transform and load—see here) purposes while graph databases may be used for data quality matching and deduplication as well as for exploring relationships.
In general the sorts of users who should care about big data are the same as those who care about data warehousing; that is relevant managers and C level executives who care about such things as:
- Customer acquisition and retention
- Customer up-sell and cross-sell
- Supply chain optimisation
- Fraud detection and prevention
- Telco network analysis
- Marketing optimisation
However, there are additional potential users in areas such as preventative maintenance, smart metering and other sensor-related activities. There is also a significant use of big data within web-based organisations such as online gaming, mobile applications and so on.
Hadoop, and its associated tools, is currently the ‘big beast’ of the big data world and the Hadoop environment is undergoing rapid development, especially in areas such as its robustness, manageability and SQL access (though there is not generally a database optimiser present), all of which are currently limited.
Gathering momentum are graph databases (essentially triple stores with an inference engine) and we expect these to grow in popularity as their ability to identify and parse relationships out to 6 or 7 degrees of separation is recognised (a typical relational databases can manage about 3 degrees before performance dies). Graph databases, however, do not run on the low-cost clustered platforms that are otherwise typical of big data solutions, so these are not inexpensive in the same way that, say, Hadoop is.
Longer term, we expect (we know of two already) relational database vendors to implement HDFS (the file system used in Hadoop) as storage engines within their databases. This will combine the low-cost storage advantages of Hadoop with a single management layer that integrates the data warehouse and big data environments.
New vendors continue to enter the market and it is too early for any consolidation. Many, but not all, suppliers offer open source solutions and may have significant venture capital backing but little in the way of revenues. We do not believe that this can continue indefinitely—there are too many vendors and too many products; it is reminiscent of the dot.com bubble. We would advise companies looking at investing in this market to be sure of their due diligence before licensing any particular product, especially if the solution to be adopted will be mission critical (which is often the case with sensor-based environments).
Notable recent announcements have been IBM’s new PureData platform based around GPFS (its version of HDFS) and the announcement by InterSystems that you can now use Globals (the Caché database without the development environment that comes with it normally) as a replacement for HDFS under Hadoop. Given how many alternatives there are to HDFS (Cassandra and RainStor to name just two more) there is going to be major guessing game as to whether HDFS will survive and, if not, what will replace it.
Further resources to broaden your knowledge:
Managing data lakes: building a business case
This is a companion paper to one we published in 2017. We outline a methodology for building a business case in support of implementing suitable data lake management software.
What’s Hot in Data
In this paper, we have identified the potential significance of a wide range of data-based technologies that impact on the move to a data-driven environment.
SQL Engines on Hadoop
There are many SQL on Hadoop engines, but they are suited to different use cases: this report considers which engines are best for which sets of requirements.
Data Lake Management
There are various factors needed to prevent a data lake becoming a swamp.
IBM Informix and the Internet of Things
This paper discusses the IBM Informix database and its suitability for deployment within Internet of Things (IoT) environments.
Total cost of ownership
TCO should be more important in decision making than either license fees or subscription costs.
Data Governance in the Big Data World
Data governance has seen a rise in adoption as organizations try to overcome data management complexities. Big data, with the sheer scale and variety of data it
All things Hadoop
Discussing the Open Data Platform and Apache Spark
The Internet of Things Reference Model
The World Forum Architecture Committee has published an IoT reference model
Product Information Management (PIM)
I often get emails from vendors talking about a whitepaper or other sales document. Sometimes these are very useful simple guides to a subject.
Extending a 360° view
In this paper we will discuss why we believe that extending the traditional 360° view makes sense and we will give some uses that demonstrate why the extended it represents an opportunity.
Creating confidence in Big Data analytics
There has been some significant criticism of the concept of big data recently, notably in the Harvard Business Review criticising the Google Flu Trends...