Analyst Coverage: Philip Howard
Big data is used as a term to refer to data that, for one reason or another, cannot be easily managed using traditional relational technology. This is either because there is simply too much data to process effectively and in a cost-efficient manner, or because some of the data is in a format (for example, text) that is not easily manipulated and stored in a standard relational database, or because the data needs to be processed so fast that normal technology cannot cope, or because of any combination of these factors.
Specifically, big data typically means instrumented or sensor-based data (sometimes called machine generated data, which would also include such things as web logs) on the one hand, and text, video, audio and similar media types, on the other. Both of these types of data have the potential to dwarf traditional data volumes. The Internet of Things is a specific example of where big data will, increasingly, be generated.
For the most part the concepts behind big data are designed to allow you to analyse very large datasets of diverse data types in a cost-effective manner, which was not previously possible. In particular, you can create a so-called “data lake” which you can explore to obtain actionable insight. New self-service tools have recently emerged, that allow business analysts (and data scientists) to explore these data lakes.
However, not all big data applications are with respect to analytics and data warehousing. A specific example of the application (there are many others) of big data in a different environment is in the extended 360o view. This is an extension to the traditional 360o view associated with master data management or customer relationship management, whereby the core data can be enriched with social media data, email-based information, call centre notes and so on, in order to get a better understanding of the customer and therefore to enable improved retention policies, or up-sell or cross-sell possibilities.
Different approaches to big data are required, depending on the issue. If the issue is simply one of scale than the preferred solution is to use low cost clustered hardware with an appropriate NoSQL database. Such products can also handle different data formats successfully, depending on the type of data to be stored and processed. However, relational database vendors are increasingly adding support for non-standard types of data (for example, JSON documents) so these products may be suitable in some instances. For certain types of processing where understanding relationships is important then graph or RDF databases may be used.
Where very rapid processing of incoming data is required a suitable NoSQL database may be used, if the volumes are relatively low (tens of thousands of events per second). Where volumes run into hundreds of thousands or millions of event per second, however, specialised streaming analytics products will be required.
It is important to appreciate that big data needs to be governed (ensuring accuracy, compliance and so forth) in much the same way that other forms of data are governed, in order to inculcate trust in any insights achieved.
Further resources to broaden your knowledge:
Managing data lakes: building a business case
This is a companion paper to one we published in 2017. We outline a methodology for building a business case in support of implementing suitable data lake management software.
SQL Engines on Hadoop
There are many SQL on Hadoop engines, but they are suited to different use cases: this report considers which engines are best for which sets of requirements.
Data Lake Management
There are various factors needed to prevent a data lake becoming a swamp.
IBM Informix and the Internet of Things
This paper discusses the IBM Informix database and its suitability for deployment within Internet of Things (IoT) environments.
Total cost of ownership
TCO should be more important in decision making than either license fees or subscription costs.
Data Governance in the Big Data World
Data governance has seen a rise in adoption as organizations try to overcome data management complexities. Big data, with the sheer scale and variety of data it
All things Hadoop
Discussing the Open Data Platform and Apache Spark
The Internet of Things Reference Model
The World Forum Architecture Committee has published an IoT reference model
Product Information Management (PIM)
I often get emails from vendors talking about a whitepaper or other sales document. Sometimes these are very useful simple guides to a subject.
Extending a 360° view
In this paper we will discuss why we believe that extending the traditional 360° view makes sense and we will give some uses that demonstrate why the extended it represents an opportunity.
Creating confidence in Big Data analytics
There has been some significant criticism of the concept of big data recently, notably in the Harvard Business Review criticising the Google Flu Trends...
Considering the small in big data
Not all of the issues addressed by big data need big data solutions