Entering the data mining fray

Written By:
Published:
Content Copyright © 2008 Bloor. All Rights Reserved.

For a long time SAS and SPSS have pretty much had the data mining market to themselves. True, there are other vendors in the market such as KXEN, InSightful and Angoss but SAS and SPSS have clearly been the big beasts in the market. However, that may be about to change with the introduction of RStat from Information Builders, which was announced earlier this summer and is scheduled to be generally available early in Q4.

RStat is an open source product that forms an optional add-on to the company’s WebFOCUS business intelligence environment. It is named after the fact that it is based on R which, in case you don’t know, is the open source statistical language. It is just a language and hitherto there has not been a user interface for it. Nevertheless, R is widely employed in universities and elsewhere (including both the US and Australian governments) with an estimated one million users worldwide. In other words, if you were going to build a data mining product R would be a good place to start.

But the fact that RStat is built on top of R is only the beginning of the story. The second chapter is that R processes data in memory, which obviously has performance advantages but has historically had scalability issues in the days when most systems were based on 32-bit processors. However, now that more and more of us are running 64-bit systems, this has more or less ceased to be a problem.

Next, RStat can either generate PMML (which is the standard for data mining models) or models can be compiled into C routines. While the former is a standard for exchanging data mining information it is only supported by some databases, so the fact that RStat supports both of these approaches means that it should run on any platform.

In terms of actual approaches, Information Builders have implemented support for the ten most popular data mining methods, including both supervised and unsupervised modelling. In the former category RStat supports decision trees, boosting, randomForests, support vector machines, logistic and linear regression, and neural networks; while in the latter category there is support for K-means and hierarchical clustering, and association rules. There is also support for sampling the data for both training and test data generation.

Finally, and this is the real kicker, RStat provides a single development environment for modellers to build predictive and scoring applications, regardless of whether these are data miners, statisticians, BI developers or business analysts. Moreover, RStat is integrated with Information Builders’ WebFOCUS so that you can leverage other company products such as Visual Discovery for interactive visualisation of your results.

So the question is how much impact RStat will have on the market. Well, it should be a no-brainer for existing Information Builders customers who don’t already have a data mining tool but want one. And it should be pretty compelling for customers who uses SAS or SPSS products, because they can migrate their existing models (using PMML) while substantially reducing their costs of ownership. And then there are green field opportunities where Information Builders will be able to offer a much broader BI/data mining environment than SPSS and which is comparable to SAS but less expensive. All in all it looks as if Information Builders may be on to a winner.