Main Navigation (Explore Website):
Avid readers will remember that I have written about EntropySoft a couple of times over the last two years. It is a specialist provider of content connectors, providing normalised access to content repositories. The connectors are optimised for particular purposes, for example Search, e-Discovery or Data Loss Prevention. The company also offers two broader products known respectively as Content Hub and Content ETL. The former provides a central point of access for all documents and is intended to simplify the integration of business applications and content repositories. The latter is used to organise, plan and do document transfers.
Since I last wrote about the company it has extended the number of its connectors to 40, most notably adding cloud-based connectors for Iron Mountain, Google Apps and SalesForce.com. There will be more such in due course. Other new connectors are Symantec Enterprise Vault, Exchange 2010 and SharePoint 2010.
The company has also extended its Content Hub and Content ETL capabilities. Most notably, the software can now track repository changes (document creation, deletion and modification) as they happen so that you can manage your document processes in real-time. In addition, the company will shortly be introducing synchronisation capabilities within its Content ETL product. This will allow you to keep documents synchronised across either heterogeneous (for example, SharePoint to Documentum) or homogeneous (SharePoint to SharePoint) repositories. And this will apply both in-house and, say, between suppliers and clients. Further, it will also allow document sharing between in-house and cloud-based environments.
This is neat. Moreover (and this is the main reason why I am writing this article), the company has just announced the introduction of an EntropySoft Appliance. This includes the company’s entire product stack and it is available either as a physical or virtual appliance. This appliance is aimed specifically at providing high performing pre-packaged services for functions such as synchronisation or digital archiving process. Repository synchronisation in particular is not only a major issue for many companies, as far as I am aware this is the first such product to be made available to resolve this issue, so it represents an important step forward. The appliance itself is available via a monthly subscription based on a pay-as-you-go model so it offers content integration for companies of all sizes.
Leaving aside its products for a moment, it has been interesting to watch the evolution of EntropySoft, in terms of its marketing. It initially went to market to get business wherever it could – a reasonable enough strategy – regardless of whether this was with end users or partners. It then turned its attention particularly to partner channels and as a result it has a very significant partner community, with its products widely embedded within other vendor’s products. Symantec, EMC Kazeon, Endeca and IDS Scheer are all notable partners in this category. The second set of partners that the company works with are what it calls service partners, which do all relevant migrations and provide professional services for EntropySoft products.
The EntropySoft Appliance solves a real business need for which there is pent-up demand. Providers of cloud-based services will certainly be interested in distributing the appliance to their customers. Indeed, I expect there to be a significant OEM market for the appliance. That’s fine: EntropySoft is well-placed to capitalise on this demand. On the other hand, I also expect that many organisations will ask for synchronisation capabilities internally (in some cases with very specific requirements). This does pose a potential problem. I suspect that EntropySoft may need more service partners or will need to increase its own in-house direct sales capabilities with the release of its appliance. The company has said that it is opening a US operation and that it is actively recruiting partners to distribute the appliance. EntropySoft will need to ensure that it has sufficient resources, either internally or via service partners, to meet the demand that is likely to arise.
I will leave it to others to comment on the overall impact that Mark Hurd has had on HP since he became CEO, and the only general remark I will make about his departure is that a drop in the share price of something approaching 10% seems pretty much overdone. What I will discuss is whether this has any impact on NeoView, the company’s data warehousing offering.
There are two things we can say about NeoView. The first is that it has very much been Mark Hurd’s baby. As the former head of Teradata he saw an opportunity to out-compete that company in the high end data warehousing market. It is generally felt that NeoView could not be allowed to fail, because of its close association with Mark Hurd.
The second thing we can say is that NeoView has not been as successful as HP would have liked. This is probably an understatement: when I talk to other vendors HP is never considered to be a serious competitor and it is rarely mentioned by end users either. My guess is that the NeoView group is loss making and, moreover, that these are significant losses: if it was a start-up it could be having difficulty in raising further funding.
The question is whether the new CEO will see NeoView as something that he or she will want to continue to invest in? Will it be viewed as strategic and, if not, can profitability be seen in the future? There has to be a significant possibility that the answers to all of these questions is no. If that is the case, and lacking the emotional and reputational association with NeoView that Mark Hurd has had, then there must be a serious risk that the product will be canned. Of course, this won’t happen soon: a new CEO has to be appointed and even then there will be other priorities – NeoView is a pretty small part of HP – but he or she will get around to it sooner or later.
I have been accused on more than one occasion of being a technophobe. Which is kind of weird, given my profession. However, it is true that my mobile phone is usually switched off (for historic reasons—I used not to be able to get a reliable signal where I live—it is still only 2G—and I never got into the habit of turning it on). Similarly, I don’t use Facebook, I don’t text and I rarely Tweet (only when I have something to say, and even then I sometimes forget). So, it comes as a bit of a surprise that I am now on YouTube.
This is all the fault of Kalido: they took me out for a nice dinner while I was in Boston recently, plied me with copious quantities of wine and then interviewed me about my views on data governance. I guess I could write down those thoughts here but that would be just replication of what’s in the video. While I knew I was being filmed (the quantities were not that copious!) I suppose I thought of it as a sort of home video rather than anything that was going to be made generally available. Anyway, you can catch my thoughts at http://www.youtube.com/view_play_list?p=53B96D702917CA64.
I wrote about RainStor back in December, just after they changed their name from Clearpace Software and moved their headquarters to the United States (though development continues primarily in the UK).
As a reminder, RainStor provides a highly compressed (typically 40 times but can be as much as 100 times) file system. A couple of notable features are worth mentioning. The first is that if you are using RainStor for relational data (typically, for application retirement or archival—RainStor is used within Informatica’s Data Archive [previously Applimation] product) then RainStor ingests the schema as well as the data. It then supports schema evolution, so that you can make queries at a point in time (that is, you can look at the data exactly as it would have appeared at a particular point in time). Secondly, it includes a query engine that supports (translates) incoming SQL so that you can run conventional business intelligence environments against RainStor.
Ok, so that was the position. Now the company has released RainStor 4 and announced the completion of a B round of funding from investors that include Informatica, as well as new partners such as EMC (Atmos private cloud).
RainStor 4 provides new platform support, improved performance for both ingestion and queries (50% improvement in both cases) and, perhaps most significantly, a number of new compliance features. These last include legal hold, record level expiry and auto-delete capabilities as well as the ability to add comments and to audit these.
Perhaps most interesting, however, are the vertical sectors that RainStor is intending to address in conjunction with partners. The first of these is in telcos, where the company already has a partnership with Group 2000 to address the EU data retention directive but the company also sees scope in the United States and other areas where lawful intercepts are more of an issue, as well as for non-CDR data for monitoring SOX compliance for example.
The other two sectors that RainStor has identified are in healthcare and in financial services, where there are major compliance considerations in both cases. Another potentially fruitful area is for log management and the SIEM (security information and event management) markets where RainStor is significantly more efficient than the home grown file stores (which often don’t support SQL either) that many of the vendors have developed. Yet another possibility is to use RainStor for near-line storage in data warehousing environments, where it will be a lot less costly to store rarely used, historic data in RainStor as opposed to the warehouse itself.
All of these areas (with the exception of telcos) are works in progress and we will have to wait to see what transpires. However, what does seem clear is that RainStor is emerging as the clear leader in its space. While there aren’t many companies that specialise in this sort of technology it seems apparent that RainStor is definitely moving ahead of its rivals.
Pervasive has just released version 4.4 of its DataRush platform. Which you might think, being a point release, is just more of the same (whatever that same is—I’ll come to that in a moment). However, that would be an incorrect assumption: DataRush 4.4 represents a radical, and important, new direction for DataRush.
So, to go back to the beginning: what is DataRush? In a nutshell it’s a very fast parallel engine for doing stuff. In particular, it’s a cross-core parallel engine. What that means is that if you have an eight core machine then you get eight parallel processing streams. While there are a few other vendors in particular markets that have developed comparable capabilities most vendors that deliver parallelised products do so across machines: so you would need eight servers to get eight-way parallelism, for example, rather than one server with eight cores. As you can imagine, that makes DataRush very much more cost effective.
DataRush differs from those few other suppliers that have built intra-core parallelism in that it is a general purpose engine. That is to say, you can OEM it for whatever purpose suits you. In so far as Pervasive itself has been concerned, to date it has focused on high performance data preparation (the company has both data profiling and matching technologies that run on top of DataRush) both for generic data cleansing purposes and to streamline preparation time for data mining and analytic functions.
So, that was the position up until version 4.2. But with 4.4, DataRush will actually perform your data mining operations for you. With this release the company has introduced an analytics function library that includes k-Means clustering; naïve Bayes, decision tree (C4.5) and k-nearest neighbour classification algorithms; four types of regression association rule mining and principal component analysis. This has been integrated with Eclipse-based workflow from the open source data mining vendor: KNIME (which is German). In addition, DataRush 4.4 also supports PMML (predictive modelling mark-up language) so you can import any existing models you may have.
The idea with DataRush is that you extract the data from your data warehouse and then process the data within the DataRush engine, making use of its inexpensive parallelism. The potential alternatives to this are a) do data mining the old fashioned way, which means extracting the data to an application server and then running the analytics there or b) perform data mining in the database where that is available. DataRush should be significantly faster, more accurate (since you shouldn’t need to sample the data) and less expensive than the first of these. With respect to the second, the short answer is that I don’t know how it will stack up: you still have to move the data, which is a downside but otherwise it will likely depend on the environment. Typically, you already have a data processing workload on your warehouse or mart so any additional in-database analytics may impact on existing workloads, so you will have to extend your warehouse: which will be most effective in performance and cost terms—using DataRush or in-database analytics—will only be proven once we have had some competitive proofs of concept. Of course, a lot of warehouse vendors do not yet have, or do not have very advanced, in-database analytics so in those cases DataRush should certainly represent a significant contender.
It is easy to say what Kapow (the product) is: it's a web data server that is available either as a product or via software as a service (SaaS). However, that doesn't tell you what it does or, more especially, what it can be used for. This is because web data services have such broad applicability within the enterprise. For example, Kapow can be used to power BI applications based on social media data for improved predictive analytics, or to enable and extend the value of mashups with real-time web data, or to automate content migration from one content management store to another. That's quite a range!
Regardless of the business use, the point about Kapow is that you can, without any coding, access any web-enabled source, extract content from it, combine that content with any other similarly captured content and then use the results in more or less any way that you wish. The database module, for example, extends this capture capability to information stored in leading SQL databases or generated by leading search or business intelligence products.
You can use Kapow as an ETL (extract, transform and load) tool for any web-based content, with the product's Design Studio as a visual IDE for defining transforms (which don't just have to be for content and could be for directly building applications); you can use portal clipping, along with other capabilities, to build mash-ups; you can use Kapow's native capabilities to support the collection of data for analytic purposes, such as sentiment analysis derived from Twitter feeds and, with the content migration module, you can automate all the migration of content into your new CMS (content management system).
So we can't adequately define Kapow by what it does because it can be used in lots of different ways. On the other hand, saying that it is a Web Data Server hardly conjures up the range of environments in which Kapow can be used. Which means that we need to get down to fundamentals: what exactly is it that Kapow does under the covers?
In practice, Kapow is about collecting web-enabled or database-held information, manipulating it and then passing the aggregated and transformed data to an application that wants to process that data in some way. Now, you could say that that's what a data integration tool does and, indeed, Kapow certainly has functionality that overlaps with products in that category. On the other hand you wouldn't use Kapow for loading data from an operational database into a data warehouse (though you might use it for loading external web-based data to augment internal data).
There's one other thing (actually there are several) that I haven't mentioned and that is that you can use Kapow to wrap a web application and expose it as a service. Now, I think, we are getting to the heart of the matter: Kapow captures and presents information on demand (though I hesitate to use that term, since it is almost proprietary). In other words, Kapow is a Web data services product, giving you agile access to anything you can see in a browser. What's more I like it. I like it a lot.
There are scads of NoSQL databases such as MongoDB and Apache Cassandra. According to Wikipedia there are 44 of them—though that isn’t a very accurate listing because InterSystems Caché, for example, does support SQL. Non-relational would be a more accurate distinction but I think it’s a bit silly to include object-oriented, multi-valued and key-valued databases, amongst others, in a single bucket. Anyway, what they all aim to do is to provide better performance than comparable relational databases. In most instances however (InterSystems is an exception), the key word is ‘comparable’: MongoDB, for example, is document oriented and, therefore, not really comparable.
VoltDB, whose availability has just been formally announced, is not a NoSQL database. However, it is aimed at out-performing traditional relational databases. Moreover, there is no ‘comparable’ qualification: VoltDB aims to, and does, out-perform relational databases in their heartland of OLTP. For example, VoltDB has performed benchmarks comparing VoltDB with the latest release of a well-known database. Using what it describes as a “TPC-like” test VoltDB recorded 53,000 transactions per second on a single node system while the market database product could only manage 1,155 tps. With a 12 node system it has been able to demonstrate 0.9x scalability. One of its existing customers (there were more than 150 beta sites), which is an online gaming site, again with 12 nodes, has recorded 1.3m tps. By anyone’s standard that’s fast.
The key things that you get with VoltDB, apart from high performance, is that it is a relational database supporting SQL, and it supports full ACID capabilities for transactional consistency. You don’t get those things with NoSQL databases.
How does VoltDB do it? Simply put, it removes all of the baggage that has accreted in relational databases over the last 30 years. In particular, it does not have latching, locking or logging, and it doesn’t need buffer management because it uses modern in-memory processing techniques. According to VoltDB these four things alone take up 93% of processing time in traditional OLTP databases before you even start to do anything else.
VoltDB hasn’t been a well-kept secret and I and others have written about it in the past so I won’t belabour the technical issues. What I don’t think the market knew, however, is that VoltDB is available via an open source GPL license with community support, or there is a commercial license available for those that do want formal support, professional services and so on.
The question, of course, is how much impact VoltDB will make on the general purpose market. I suspect it will be considerable. Mike Stonebraker has a history of bringing successful products to market (for example, Ingres and, more recently, both Streambase and Vertica). On the other hand I am not sure that I see many companies porting mission critical applications to VoltDB from Oracle or DB2, at least in the short term. However, I do seeing them moving away from MySQL (even with Memcached). It is also bad news for the (true) NoSQL databases: there aren’t that many environments that don’t need transactional consistency (Cassandra’s implementation at Facebook is one example that doesn’t) and, aside from anti-SQL bigots almost everybody wants SQL support. I think the future for VoltDB looks bright.
Back at the beginning of March, Bloor Research published a paper describing our Spreadsheet Management Maturity Model (you can download it from www.bloorresearch.com/research/market-update/1094/Spreadsheet-management-maturity-model.html) but I didn’t, for one reason or another, get around to writing a blog about it.
There are a couple of interesting things about this model. The first is that none of the vendors in this space nor, as far as I am aware, any of the auditing or consulting firms, have developed their own models. Which is, I guess, why I thought it would be a good idea. The second is that it is actually a dual model.
What I mean by this is that it is not just a question of evaluating the maturity of your organisation when it comes to spreadsheets (or any end-user computing resources—such as Access databases—for that matter) but that that in itself is often a result of the maturity of the users of spreadsheets within your enterprise.
Basically, what happens is that organisations start by using spreadsheets with no real expertise at all: just some self-taught capabilities. Then some bright spark in a relatively lowly position works out that if he (or she) gets really good at using spreadsheets then that will give him a political and competitive advantage over his colleagues (or, let’s be fair, he may simply be interested in doing his job better). In any case, he becomes the in-house guru on spreadsheets, he writes custom macros to implement any controls that may be put around the spreadsheets (which, of course, can’t be maintained once he leaves the company), and he becomes the fount of all wisdom on all things Excel.
Needless to say this strategy works and our hero (or heroine) duly gets promoted to a more senior position and, based on his own experience with spreadsheets starts to drive a more formalised approach throughout the company, leading, ultimately to an environment where users actually get proper training on best practices for spreadsheet development and use of spreadsheets to ensure that they are compliant with both internal governance policies and external regulations such as Sarbanes-Oxley.
Ultimately, of course, spreadsheet management should come under the umbrella of data governance as it covers exactly the same data quality issues as more formalised environments and, indeed, relational databases and spreadsheets often act as data sources to one another so it is logical that they should come under the same control.
Anyway, that’s a starter on our Spreadsheet Management Maturity Model. I will be presenting a webinar on the subject, going into more detail on the organisational aspects of the model, on May 25th (10.00am EST, 3.00pm BST, 4.00pm CET). This is being run in conjunction with CIMCON and you can register for the webinar at www.sarbox-solutions.com/webinar/live_webinars/sox-xl_webinar.asp.
Vertica, along with various other companies, has made a major announcement at this week's TDWI, namely with respect to Vertica 4.0. This is a major release by any measure with the product being in beta today and general availability scheduled for the second quarter.
There are a number of major themes in this release, of which perhaps the most important is the workload management capability that is being introduced. This works in a similar manner to Teradata, in the sense that it is based on resource pools that are assigned to either users or types of tasks. The aim is to support any workload with any mix of query types and it will enable Vertica to go after enterprise data warehouse requirements where it could not previously compete.
A second major feature is support for Unicode. Historically, the product was limited to English only and this, again, was a limiting factor, which has now been removed.
Thirdly, Vertica has tackled one of the issues that the product formerly had, namely memory overflows. In order to make writing to disk more efficient, Vertica has what might be thought of as in-memory cache (actually a Write Optimised Store) into which data is loaded prior to it being committed to disk (the Read Optimised Store). However, a problem was that the memory could become full and cause queries to queue (or get rejected) before the software had got around to writing to disk. In this release this process is automated. You can still set your own parameters but if a potential overflow is detected then the software will override those settings to initiate writing to disk. In an associated development the company has also introduced a more efficient mechanism for processing deletes and updates in the storage layer.
Fourth, Vertica has extended its SQL support to add new capabilities. For example, there is now a facility for calculating moving averages, something which is important in capital markets but is difficult to do in standard SQL (though it is relatively easy in MapReduce). There are also facilities to support time series, including gap filling (when no trade occurs within a time interval) and web sessionisation. There are also more general improvements to query capabilities. For instance, Vertica could not previously perform a full outer joinyou had to perform a left join and then a right join and combine the resultsbut now it can. Associated with all of this are improvements to the optimiser, not only to recognise all of these additional capabilities but also for its own sake. Thus the optimiser can perform re-segmentation on the fly, for example.
As I said at the outset this is a big release. It adds significant new capabilities for both vertical markets and international markets, while also extending its breadth of capabilities. Given that the company's acquisition rate is currently around 15 to 20 new customers per quarter we would expect this to further ramp up.
As usual at TDWI there are a series of announcements from major vendors. Not least of this year's releases has been the introduction of the Netezza TwinFin iClass, which follows on from last month's announcement of the Netezza Skimmer (a Skimboard is another sort of surfboard) as a low-end entry point to the Netezza range. However, the iClass (where i stands for insight) is a beast of a completely different stripe.
The iClass is an appliance for what Netezza refers to as advanced analytics, which is a part of the general drive across the sector towards in-database analytics. The big advantage of in-database analytics is, of course, that you get much, much better performance. Instead of having to extract the data to an external application server for processing, the analytics can actually be performed in situ. It also means that the analytics are more accurate because you don't need to sample the data, which is necessary in conventional environments in order to maintain performance. It's what Netezza refers to as big data meets big math.
So, what does the iClass actually provide? To begin with, it isn't just SAS scoring in the warehouse: it's much more comprehensive than that. So, there are two new APIs. One is an open language API that currently supports C/C++, Java, Python, Fortran and, most interestingly, R. It should be easy to add support for others should there be sufficient demand and there is an SDK so you can implement your own. The second API is an Open Framework API that supports MapReduce and Hadoop.
And then there are massively parallel analytics engines that parallelise analytic operations including embarrassingly parallel algorithms for processes that lend themselves to parallelism, task parallelism (for model execution) and algorithms for not embarrassingly parallel processes that parallelise these as much as possible. Specifically, these engines support user defined extensions (functions, aggregates and table functions) where these are to be run within a process; analytic executables, which perform the same role but outside of a process, and nzMatrix. This last is a part of the out of the box analytics functions provided by Netezza, in this case specifically focused on linear algebra with support for the resolution of simultaneous linear equations, least squares, eigenvalues and singular value problems. If youre not a mathematician or statistician I won't bother to explain what these are but suffice it to say that they are important in certain complex analytic computations.
You might think I've finished but I haven't. On top of all that the iClass also supports an R GUI and Eclipse as well, of course, as partner-based development environments.
Which brings me on to my final point, which is that Netezza is neither getting into the analytic application business nor into the data mining business. Instead it is providing a foundation platform for its partners (like bis2 and Fuzzy Logix) to build analytic applications on. I think this makes a lot of sense: after all, there are far more potential partners with feet on the ground, providing greater coverage than Netezza could manage on its own.
The bottom line is that this is a major step forward for Netezza, differentiating itself still further from the mass of competitors that have yet to implement any sort of in-database analytics. And even those that are doing so are in many cases only providing much more elementary out of the box capabilities.