skip to Main Content

Big Data

Last Updated:
Analyst Coverage:

Big Data refers to the ability to analyse any type of data and not just the relational data that is usually analysed in data warehouses. This typically means instrumented or sensor-based data (sometimes called machine generated data) on the one hand and text, video, audio and similar media types on the other. Both of these types of data have the potential to dwarf the relational (transactional) data in terms of the quantities of such data that are generated and available for analysis: hence the term “big”.

Big data technologies are also noteworthy because they are often inexpensive. Many (not all) can be implemented across low-cost commodity servers, which makes the storage of large amounts of data a much more realistic proposition, from a financial point of view, than it was previously.

Big data technologies are essentially extensions to a data warehousing environment although there are some exceptions, notably where there are also operational (usually real-time) requirements. As such, big data provides exactly the same sorts of business intelligence and analytic functionality.

These extensions are usually implemented at the back-end of the data processing environment alongside the data warehouse or mart but, where there are very high volumes of data that need to be processed in a very short time, then the big data solution may be implemented prior to storing the data in a data warehouse. These latter solutions may use Complex Event Processing, also known as (event) stream processing, or there are big data solutions (for example, based on Cassandra) that may be used for this purpose, the difference being that the former tends to be better when the model being processed is static and the latter when it is fluid.

Because of the low cost of many big data platforms these may also be used for other purposes besides business intelligence and analytics. For example, a number of companies are using Hadoop as a platform for ETL (extract, transform and load—see here) purposes while graph databases may be used for data quality matching and deduplication as well as for exploring relationships.

In general the sorts of users who should care about big data are the same as those who care about data warehousing; that is relevant managers and C level executives who care about such things as:

  • Customer acquisition and retention
  • Customer up-sell and cross-sell
  • Supply chain optimisation
  • Fraud detection and prevention
  • Telco network analysis
  • Marketing optimisation

However, there are additional potential users in areas such as preventative maintenance, smart metering and other sensor-related activities. There is also a significant use of big data within web-based organisations such as online gaming, mobile applications and so on.

Hadoop, and its associated tools, is currently the ‘big beast’ of the big data world and the Hadoop environment is undergoing rapid development, especially in areas such as its robustness, manageability and SQL access (though there is not generally a database optimiser present), all of which are currently limited.

Gathering momentum are graph databases (essentially triple stores with an inference engine) and we expect these to grow in popularity as their ability to identify and parse relationships out to 6 or 7 degrees of separation is recognised (a typical relational databases can manage about 3 degrees before performance dies). Graph databases, however, do not run on the low-cost clustered platforms that are otherwise typical of big data solutions, so these are not inexpensive in the same way that, say, Hadoop is.

Longer term, we expect (we know of two already) relational database vendors to implement HDFS (the file system used in Hadoop) as storage engines within their databases. This will combine the low-cost storage advantages of Hadoop with a single management layer that integrates the data warehouse and big data environments.

New vendors continue to enter the market and it is too early for any consolidation. Many, but not all, suppliers offer open source solutions and may have significant venture capital backing but little in the way of revenues. We do not believe that this can continue indefinitely—there are too many vendors and too many products; it is reminiscent of the dot.com bubble. We would advise companies looking at investing in this market to be sure of their due diligence before licensing any particular product, especially if the solution to be adopted will be mission critical (which is often the case with sensor-based environments).

Notable recent announcements have been IBM’s new PureData platform based around GPFS (its version of HDFS) and the announcement by InterSystems that you can now use Globals (the Caché database without the development environment that comes with it normally) as a replacement for HDFS under Hadoop. Given how many alternatives there are to HDFS (Cassandra and RainStor to name just two more) there is going to be major guessing game as to whether HDFS will survive and, if not, what will replace it.

Solutions

  • Actian logo
  • AWS logo
  • ATACCAMA logo
  • Cambridge Semantics (logo)
  • CAZENA logo
  • CLOUDERA logo
  • CRATE.io logo
  • DataStax (logo)
  • EXASOL (logo)
  • FAUNA logo
  • Franz Inc (logo)
  • Grakn logo
  • Greenplum logo
  • HITACHI logo
  • IBM (logo)
  • INFLUXDATA logo
  • Informatica (logo)
  • INTERANA logo
  • KX Logo
  • MARK LOGIC logo
  • McOBJECT logo
  • Memgraph (logo)
  • Microsoft (logo)
  • N5 logo
  • Neo4j (logo)
  • Objectivity (logo)
  • Ontotext (logo)
  • Oracle (logo)
  • Qlik logo
  • QUASAR DB logo
  • Redis Labs (logo)
  • SCYLLA logo
  • SingleStore logo
  • Software AG (logo)
  • SOLIX logo
  • SPARSITY logo
  • STARBURST logo
  • Stardog (logo)
  • TALEND logo
  • teradata logo
  • TIBCO (logo)
  • TigerGraph (logo)
  • TIMESCALE logo
  • Trendalyze (logo)
  • Unifi (logo)
  • VERTICA logo
  • VICTORIAMETRICS logo
  • YELLOWBRICK logo

These organisations are also known to offer solutions:

  • Databricks
  • Esgyn
  • HortonWorks
  • Kognitio
  • Pivotal
  • Precisely
  • SAP
  • Snowflake
STARDOG InBrief cover thumbnail

Stardog (2020)

Stardog is an RDF database with strong support for SPARQL and OWL that can be extended to provide labelled property graph capabilities.
TRENDALYZE InBrief cover thumbnail

Trendalyze (February 2020)

Trendalyze describes its core capability as the discovery of motifs (micro-trends) and anomalies within time series data.
00002590 - NEO4J InBrief cover thumbnail

Neo4j (2020)

Neo4j is a property graph database with a native engine that is targeted at operational, hybrid operational/analytic (HTAP) and pure analytic use cases.
Cover for Graph Databases 2019

Graph Database Market Update 2019

This is the third Market Update into the graph database market, considering and comparing both property graph and RDF databases.
Cover for Big data and the mainframe

Big data and the mainframe - issues and opportunities

The purpose of this paper is to examine those issues, which arise when big data implementations transition beyond skunk works and into general-purpose use.
SOFTWARE AG InBrief cover thumbnail

TrendMiner, a Software AG company

TrendMiner is a self-service analytics solution designed for domain experts within the process manufacturing space.
TERADATA InBrief cover thumbnail

Teradata Vantage (July 2020)

Teradata Vantage effectively consists of a merger between what was previously simply Teradata Database, and Aster Analytics.
IBM InBrief cover thumbnail

IBM Db2 Event Store

IBM Db2 Event Store is an in-memory database built on top of Apache Spark, intended to support both near real-time and deep analytics on historic data.
Back To Top