DataRush extends its boundaries

Pervasive has just released version 4.4 of its DataRush platform. Which you might think, being a point release, is just more of the same (whatever that same is—I’ll come to that in a moment). However, that would be an incorrect assumption: DataRush 4.4 represents a radical, and important, new direction for DataRush.

So, to go back to the beginning: what is DataRush? In a nutshell it’s a very fast parallel engine for doing stuff. In particular, it’s a cross-core parallel engine. What that means is that if you have an eight core machine then you get eight parallel processing streams. While there are a few other vendors in particular markets that have developed comparable capabilities most vendors that deliver parallelised products do so across machines: so you would need eight servers to get eight-way parallelism, for example, rather than one server with eight cores. As you can imagine, that makes DataRush very much more cost effective.

DataRush differs from those few other suppliers that have built intra-core parallelism in that it is a general purpose engine. That is to say, you can OEM it for whatever purpose suits you. In so far as Pervasive itself has been concerned, to date it has focused on high performance data preparation (the company has both data profiling and matching technologies that run on top of DataRush) both for generic data cleansing purposes and to streamline preparation time for data mining and analytic functions.

So, that was the position up until version 4.2. But with 4.4, DataRush will actually perform your data mining operations for you. With this release the company has introduced an analytics function library that includes k-Means clustering; naïve Bayes, decision tree (C4.5) and k-nearest neighbour classification algorithms; four types of regression association rule mining and principal component analysis. This has been integrated with Eclipse-based workflow from the open source data mining vendor: KNIME (which is German). In addition, DataRush 4.4 also supports PMML (predictive modelling mark-up language) so you can import any existing models you may have.

The idea with DataRush is that you extract the data from your data warehouse and then process the data within the DataRush engine, making use of its inexpensive parallelism. The potential alternatives to this are a) do data mining the old fashioned way, which means extracting the data to an application server and then running the analytics there or b) perform data mining in the database where that is available. DataRush should be significantly faster, more accurate (since you shouldn’t need to sample the data) and less expensive than the first of these. With respect to the second, the short answer is that I don’t know how it will stack up: you still have to move the data, which is a downside but otherwise it will likely depend on the environment. Typically, you already have a data processing workload on your warehouse or mart so any additional in-database analytics may impact on existing workloads, so you will have to extend your warehouse: which will be most effective in performance and cost terms—using DataRush or in-database analytics—will only be proven once we have had some competitive proofs of concept. Of course, a lot of warehouse vendors do not yet have, or do not have very advanced, in-database analytics so in those cases DataRush should certainly represent a significant contender.