So, what is data virtualisation?

Data virtualisation is the latest technology to enjoy its moment in the hypelight and there has been some considerable debate within the blogosphere about what it actually is and what its relationship is to data federation, data integration and EII (enterprise information integration).

Rather than start from scratch I thought I would go back through my files and see what I had written about this in the past (if anything). I found the following definition of an EII platform (that is, what you need to support EII, which is, after all, about information rather than mere data). What I wrote, some three years ago, was that an EII platform needs to do four things:

“It virtualises your data – it makes all relevant data sources, including databases, application environments and other places where data may be sourced, appear as if they were in one place so that you can access that data as such.
“It abstracts your data – that is to say, it conforms your data so that it is in a consistent format regardless of any native structure and syntax that may be in use in the underlying data sources.
“It federates the data – it provides the connectivity that allows you to pull data together, from diverse, heterogeneous sources (which may contain either operational or historical data or both) so that it can be virtualised. It should also enable things like push-down optimisation so that query joins can be mastered in the optimal place.
“It presents the data in a consistent format to the front-end application (typically, but not always, a BI tool) either through relational views (via SQL) or by means of web/data services, or both.”

Actually, I didn’t quite write that: I have updated it somewhat but the gist is the same.

Clearly, data federation is not the same as data virtualisation. Moreover, federation is not necessary for virtualisation, depending on why you are doing the virtualisation. If you want to link a number of data marts together so that you can query across them then clearly the query optimisation capabilities of a federation engine will be necessary. On the other hand, if you want to create Mashups or other applications that have relatively lightweight access requirements, or you want to use virtualisation to support MDM-like capabilities, then such functions may not be necessary. Instead you can use data services. Data services may also be more appropriate in environments where less of the data is relational and more of it comes from a variety of unstructured sources or from the web. Indeed, there is a whole new discussion to be had about the distinctions between data virtualisation for unstructured data and structured data (or a combination of the two) but that’s a subject for another day.

The other question that arises is whether parts 1, 2 and 4 are all actually parts of the same thing. I think 2 and 4 probably are or, at least, the differences are so slight that there is no point in making a distinction.

Parts 1 and 2 are another issue. If data virtualisation is about having a virtual data source that does not necessarily mean that it is easy to work with. It is certainly easy to imagine a huge hybrid database that contains relational and non-relational data, pdf documents and a whole bunch of other things, but that would not necessarily mean that the data was all in a common format and, therefore, easy to work with. So, I think both 1 and 2 are required and are different. It is certainly true that it does not make much sense to implement data virtualisation without an abstraction layer but that doesn’t mean they are the same thing.

Finally, I haven’t talked about data integration at all. Well, the fact is that leading data integration products support data services so you should certainly be able to virtualise data sources even if you can’t federate them (they won’t typically have the sort of distributed query optimiser you would want from a data federation product). The question will be how easy it is to build the abstraction layer with a data integration tool. Of course, you can create all the transformations and mappings necessary for this purpose but what you would really like is something that automates a lot of this abstraction rather than requiring you to build it for yourself. It is in these two areas—federation and automated abstraction—that the pure players in the market, especially Composite Software and Denodo, have a significant advantage over the data integration vendors.