Analyst Coverage: Philip Howard
Big data is used as a term to refer to data that, for one reason or another, cannot be easily managed using traditional relational technology. This is either because there is simply too much data to process effectively and in a cost-efficient manner, or because some of the data is in a format (for example, text) that is not easily manipulated and stored in a standard relational database, or because the data needs to be processed so fast that normal technology cannot cope, or because of any combination of these factors.
Specifically, big data typically means instrumented or sensor-based data (sometimes called machine generated data, which would also include such things as web logs) on the one hand, and text, video, audio and similar media types, on the other. Both of these types of data have the potential to dwarf traditional data volumes. The Internet of Things is a specific example of where big data will, increasingly, be generated.
For the most part the concepts behind big data are designed to allow you to analyse very large datasets of diverse data types in a cost-effective manner, which was not previously possible. In particular, you can create a so-called “data lake” which you can explore to obtain actionable insight. New self-service tools have recently emerged, that allow business analysts (and data scientists) to explore these data lakes.
However, not all big data applications are with respect to analytics and data warehousing. A specific example of the application (there are many others) of big data in a different environment is in the extended 360o view. This is an extension to the traditional 360o view associated with master data management or customer relationship management, whereby the core data can be enriched with social media data, email-based information, call centre notes and so on, in order to get a better understanding of the customer and therefore to enable improved retention policies, or up-sell or cross-sell possibilities.
Different approaches to big data are required, depending on the issue. If the issue is simply one of scale than the preferred solution is to use low cost clustered hardware with an appropriate NoSQL database. Such products can also handle different data formats successfully, depending on the type of data to be stored and processed. However, relational database vendors are increasingly adding support for non-standard types of data (for example, JSON documents) so these products may be suitable in some instances. For certain types of processing where understanding relationships is important then graph or RDF databases may be used.
Where very rapid processing of incoming data is required a suitable NoSQL database may be used, if the volumes are relatively low (tens of thousands of events per second). Where volumes run into hundreds of thousands or millions of event per second, however, specialised streaming analytics products will be required.
It is important to appreciate that big data needs to be governed (ensuring accuracy, compliance and so forth) in much the same way that other forms of data are governed, in order to inculcate trust in any insights achieved.