Federating Big Data

The data federation and virtualisation vendors like Composite Software, Denodo and, to a lesser extent, Informatica, have had it all their own way for some time but now there’s a new kid on the block. Cirro has launched this week as a federated solution designed specifically to cater for big data environments as well as traditional warehouses and marts.

Cirro has been built from the ground up to support big data, rather than having this appended after the fact. So, for example, it not only supports HiveQL but also can generate MapReduce directly so that you can bypass the relatively limited functionality of Hive. Today the company supports Hadoop and soon will be supporting others like Splunk, MongoDB and Cassandra. I understand that the company has already had requests to support SPARQL (for graph databases) and this is also likely to be available in due course. All of which means that Cirro is likely to have much broader support for big data than other vendors in the market.

However, this is not the biggest difference between Cirro and other federation vendors. With other suppliers you start by creating a semantic layer that third party products such as business intelligence tools access. You can do this by using a data modelling tool or by defining virtual views, for example, but in any case there is a process to go through. Cirro has taken a different approach. Its argument is that BI tools have their own semantic layer built-in so why go through the effort and cost of having to define a further semantic layer? Instead, Cirro has provided a library of extensible Excel functions that you can use to create relevant “views” using the Cirro Excel plug-in. Sort of like a 4GL for federation – you don’t have to understand Hive or MapReduce or even SQL, all that’s under the covers – if you can use Excel functions you can use Cirro. For example, in the demonstration I saw it took just a couple of minutes to create a view across Twitter feeds stored in Hadoop and combine that with data from a MySQL database. This view can be used directly by the business analyst and then thrown away if it was a one-off, or it can be published to the Cirro Data Hub. If you then want to look at this data using a business intelligence tool then Cirro provides that information to the tool using its own understanding of that tool’s semantics.

Of course the other big requirement for data federation is query performance and Cirro has spent a lot of time on its optimiser. It is cost based, with knowledge of the resources available to it, provides smart caching for both results and intermediate results, and can normalise costs across data sources so that it can ensure that, when comparing query plans, it is comparing apples with apples, and the product includes dynamic re-optimisation. This last means that, as sub-queries complete, the optimiser re-runs to see if the original plan is still best or if another plan would be better going forward. The product also comes with the option of what the company calls the Cirro Multi Store, which can be used as a staging database for both structured and unstructured data and can also be used directly during query processing when that is appropriate or as a data mart.

Cirro is largely staffed by ex-DATAllegro and Microsoft people and, as a result, it has good contacts throughout the warehousing and BI spaces. So it is not surprising that the company is already getting some traction both with beta customers and partners. It looks likely to create a serious challenge to the existing players in this market.