skip to Main Content

Big Data

Last Updated:
Analyst Coverage:

Big Data refers to the ability to analyse any type of data and not just the relational data that is usually analysed in data warehouses. This typically means instrumented or sensor-based data (sometimes called machine generated data) on the one hand and text, video, audio and similar media types on the other. Both of these types of data have the potential to dwarf the relational (transactional) data in terms of the quantities of such data that are generated and available for analysis: hence the term “big”.

Big data technologies are also noteworthy because they are often inexpensive. Many (not all) can be implemented across low-cost commodity servers, which makes the storage of large amounts of data a much more realistic proposition, from a financial point of view, than it was previously.

Big data technologies are essentially extensions to a data warehousing environment although there are some exceptions, notably where there are also operational (usually real-time) requirements. As such, big data provides exactly the same sorts of business intelligence and analytic functionality.

These extensions are usually implemented at the back-end of the data processing environment alongside the data warehouse or mart but, where there are very high volumes of data that need to be processed in a very short time, then the big data solution may be implemented prior to storing the data in a data warehouse. These latter solutions may use Complex Event Processing, also known as (event) stream processing, or there are big data solutions (for example, based on Cassandra) that may be used for this purpose, the difference being that the former tends to be better when the model being processed is static and the latter when it is fluid.

Because of the low cost of many big data platforms these may also be used for other purposes besides business intelligence and analytics. For example, a number of companies are using Hadoop as a platform for ETL (extract, transform and load—see here) purposes while graph databases may be used for data quality matching and deduplication as well as for exploring relationships.

In general the sorts of users who should care about big data are the same as those who care about data warehousing; that is relevant managers and C level executives who care about such things as:

  • Customer acquisition and retention
  • Customer up-sell and cross-sell
  • Supply chain optimisation
  • Fraud detection and prevention
  • Telco network analysis
  • Marketing optimisation

However, there are additional potential users in areas such as preventative maintenance, smart metering and other sensor-related activities. There is also a significant use of big data within web-based organisations such as online gaming, mobile applications and so on.

Hadoop, and its associated tools, is currently the ‘big beast’ of the big data world and the Hadoop environment is undergoing rapid development, especially in areas such as its robustness, manageability and SQL access (though there is not generally a database optimiser present), all of which are currently limited.

Gathering momentum are graph databases (essentially triple stores with an inference engine) and we expect these to grow in popularity as their ability to identify and parse relationships out to 6 or 7 degrees of separation is recognised (a typical relational databases can manage about 3 degrees before performance dies). Graph databases, however, do not run on the low-cost clustered platforms that are otherwise typical of big data solutions, so these are not inexpensive in the same way that, say, Hadoop is.

Longer term, we expect (we know of two already) relational database vendors to implement HDFS (the file system used in Hadoop) as storage engines within their databases. This will combine the low-cost storage advantages of Hadoop with a single management layer that integrates the data warehouse and big data environments.

New vendors continue to enter the market and it is too early for any consolidation. Many, but not all, suppliers offer open source solutions and may have significant venture capital backing but little in the way of revenues. We do not believe that this can continue indefinitely—there are too many vendors and too many products; it is reminiscent of the dot.com bubble. We would advise companies looking at investing in this market to be sure of their due diligence before licensing any particular product, especially if the solution to be adopted will be mission critical (which is often the case with sensor-based environments).

Notable recent announcements have been IBM’s new PureData platform based around GPFS (its version of HDFS) and the announcement by InterSystems that you can now use Globals (the Caché database without the development environment that comes with it normally) as a replacement for HDFS under Hadoop. Given how many alternatives there are to HDFS (Cassandra and RainStor to name just two more) there is going to be major guessing game as to whether HDFS will survive and, if not, what will replace it.

Solutions

  • Cambridge Semantics (logo)
  • Grakn (logo)
  • Microsoft (logo)
  • Trendalyze (logo)
  • Unifi (logo)

These organisations are also known to offer solutions:

  • Cloudera
  • HortonWorks
  • IBM
  • Oracle
  • Pivotal
  • Starburst
  • Teradata
Cover for the ArangoDB InBrief

ArangoDB

ArangoDB is a multi-model database that supports document (JSON), key-value and property graph capabilities with one database core and one declarative query language.
Cover for the Cambridge Semantics AnzoGraph InBrief

Cambridge Semantics AnzoGraph

AnzoGraph is a massively parallel RDF database targeted primarily at large scale analytic environments
Cover for the Neo4j InBrief

Neo4j

Neo4j is a labelled, property graph database with a native engine that is targeted at operational and hybrid operational/analytic use cases.
Cover for Microsoft Azure Cosmos DB

Microsoft Azure Cosmos DB

Cosmos DB is a distributed multi-model database that is provided as a service. It supports key-value, column store, document and property graphs.
Cover for Grakn Core and Grakn KGMS

Grakn Core and Grakn KGMS

Grakn is a graph-based platform for developing cognitive and other applications leveraging artificial intelligence.
Cover for IBM Private Cloud (InDetail)

IBM Cloud Private for Data

Limited, or no, technological capability with respect to AI is holding many companies back. This paper discusses how IBM ICP for Data can help.

Cambridge Intelligence Keylines InBrief

KeyLines is a graph visualisation product that allows you to examine relationships between entities and/or events.
Cover for the Data Catalogues Hot Report

Data Catalogues

Data catalogues are hot. Why? Why should you care? What can they do for you?
Cover for Managing Data Lakes (Spotlight)

Managing data lakes: building a business case

This is a companion paper to one we published in 2017. We outline a methodology for building a business case in support of implementing suitable data lake management software.
Cover for the Trendalyze InBrief

Trendalyze

Trendalyze describes its core capability as the discovery of motifs (and anomalies) within time series data. You can think of a motif as a micro-pattern but it is more accurately a shape. Once a motif of interest is discovered, or…
Cover for What's Hot in Data?

What’s Hot in Data

In this paper, we have identified the potential significance of a wide range of data-based technologies that impact on the move to a data-driven environment.
The cover of SQL Engines on Hadoop

SQL Engines on Hadoop

There are many SQL on Hadoop engines, but they are suited to different use cases: this report considers which engines are best for which sets of requirements.
Back To Top