Another look at big data

In the previous article in this series I discussed what big data is and talked about the ability to query ALL data that is relevant to the organisation. There are several ways to look at this issue. In that previous article I focused on where the data comes from: transactional, content, instrumented and external. However, there are other viewpoints to be considered as well, notably the type of data involved and the type of query capabilities that are available. These are closely linked.

There are various forms of structured data: transactional data is structured and you can access it via conventional query tools, analytics and SQL. XML-based documents are also structured and can be accessed via XQuery. Several vendors have extended SQL capability that allows XML documents to be queried alongside transactional data. Unfortunately, neither of these approaches are much use with unstructured data.

There are various other forms of structured data. For example, the data in a spreadsheet is structured: it’s just that there is no metadata to describe what the rows and columns mean. Other examples are sensor-based, clickstream and log data. All of these are essentially structured. However, they are not really relational. They don’t typically, for example, have the primary-foreign key relationships that are typical of relational data. For this reason something like Hadoop is well-suited for storing this sort of information simply because you don’t need the complexity (and expense) of a full relational database. Nevertheless, historically, relational databases and relational access methods have been used with these sorts of data or else flat file systems.

Instrumented data is also, frequently, time-stamped. This has two different requirements: the ability to store time series data and the ability to analyse it. The former is clearly an advantage when it comes to supporting the latter. Nevertheless, a number of data warehousing products support analysis but very few relational databases (or any others of other types) store data this way. One of the few exceptions is Informix, which has supported time series since it acquired Illustra back in the 90s.

The other type of data is unstructured. Search is the historic way of querying this or, if you have built a suitable taxonomy, then you can do a more thorough analysis of content. However, if you want to analyse tweets, for example, then it is unlikely that you will have such a taxonomy, in which case you will really need to parse the data. Consider a 140 character product description: data quality tools can parse this description for things like colour, model number, number of pixels, horsepower, voltage, dimensions and other characteristics of relevant products. In other words, you can extract structured information from the text. You would really like to be able to do the same with tweets and, as it happens, Informatica has just released HParser, which is parser for a Hadoop. While this is the first such product (as far as I know) it surely won’t be the last.

Leaving that aside, traditional products aren’t very good at querying text even if they have text indexing, which products such as Sybase IQ do, which is precisely where Hadoop comes into play.

Finally, it isn’t just a question of structured and unstructured data. Frequently it will make sense to combine the two. There are different ways you can do this. One is to use a business intelligence tool that uses an index-based approach for both structured and unstructured data. Examples are Endeca (Oracle) Latitude and Connexica. These will run on top of a standard relational data warehouse. The second, theoretical, possibility is to put all the data into Hadoop, but you probably wouldn’t want to be without your data warehouse. The third option is to use a data warehouse that directly supports MapReduce, such as Aster Data (Teradata); a fourth would be to implement HBase (a column-oriented store) on top of Hadoop; and the fifth is to use a data warehouse linked via federation (companies like Denodo and Composite Software support this) or ETL processes (IBM, Informatica, Talend, Syncsort et al) to Hadoop. Most companies are opting for this last option, an approach I will discuss further in a forthcoming article.