Main Navigation (Explore Website):
Vertica, along with various other companies, has made a major announcement at this week's TDWI, namely with respect to Vertica 4.0. This is a major release by any measure with the product being in beta today and general availability scheduled for the second quarter.
There are a number of major themes in this release, of which perhaps the most important is the workload management capability that is being introduced. This works in a similar manner to Teradata, in the sense that it is based on resource pools that are assigned to either users or types of tasks. The aim is to support any workload with any mix of query types and it will enable Vertica to go after enterprise data warehouse requirements where it could not previously compete.
A second major feature is support for Unicode. Historically, the product was limited to English only and this, again, was a limiting factor, which has now been removed.
Thirdly, Vertica has tackled one of the issues that the product formerly had, namely memory overflows. In order to make writing to disk more efficient, Vertica has what might be thought of as in-memory cache (actually a Write Optimised Store) into which data is loaded prior to it being committed to disk (the Read Optimised Store). However, a problem was that the memory could become full and cause queries to queue (or get rejected) before the software had got around to writing to disk. In this release this process is automated. You can still set your own parameters but if a potential overflow is detected then the software will override those settings to initiate writing to disk. In an associated development the company has also introduced a more efficient mechanism for processing deletes and updates in the storage layer.
Fourth, Vertica has extended its SQL support to add new capabilities. For example, there is now a facility for calculating moving averages, something which is important in capital markets but is difficult to do in standard SQL (though it is relatively easy in MapReduce). There are also facilities to support time series, including gap filling (when no trade occurs within a time interval) and web sessionisation. There are also more general improvements to query capabilities. For instance, Vertica could not previously perform a full outer joinyou had to perform a left join and then a right join and combine the resultsbut now it can. Associated with all of this are improvements to the optimiser, not only to recognise all of these additional capabilities but also for its own sake. Thus the optimiser can perform re-segmentation on the fly, for example.
As I said at the outset this is a big release. It adds significant new capabilities for both vertical markets and international markets, while also extending its breadth of capabilities. Given that the company's acquisition rate is currently around 15 to 20 new customers per quarter we would expect this to further ramp up.
As usual at TDWI there are a series of announcements from major vendors. Not least of this year's releases has been the introduction of the Netezza TwinFin iClass, which follows on from last month's announcement of the Netezza Skimmer (a Skimboard is another sort of surfboard) as a low-end entry point to the Netezza range. However, the iClass (where i stands for insight) is a beast of a completely different stripe.
The iClass is an appliance for what Netezza refers to as advanced analytics, which is a part of the general drive across the sector towards in-database analytics. The big advantage of in-database analytics is, of course, that you get much, much better performance. Instead of having to extract the data to an external application server for processing, the analytics can actually be performed in situ. It also means that the analytics are more accurate because you don't need to sample the data, which is necessary in conventional environments in order to maintain performance. It's what Netezza refers to as big data meets big math.
So, what does the iClass actually provide? To begin with, it isn't just SAS scoring in the warehouse: it's much more comprehensive than that. So, there are two new APIs. One is an open language API that currently supports C/C++, Java, Python, Fortran and, most interestingly, R. It should be easy to add support for others should there be sufficient demand and there is an SDK so you can implement your own. The second API is an Open Framework API that supports MapReduce and Hadoop.
And then there are massively parallel analytics engines that parallelise analytic operations including embarrassingly parallel algorithms for processes that lend themselves to parallelism, task parallelism (for model execution) and algorithms for not embarrassingly parallel processes that parallelise these as much as possible. Specifically, these engines support user defined extensions (functions, aggregates and table functions) where these are to be run within a process; analytic executables, which perform the same role but outside of a process, and nzMatrix. This last is a part of the out of the box analytics functions provided by Netezza, in this case specifically focused on linear algebra with support for the resolution of simultaneous linear equations, least squares, eigenvalues and singular value problems. If youre not a mathematician or statistician I won't bother to explain what these are but suffice it to say that they are important in certain complex analytic computations.
You might think I've finished but I haven't. On top of all that the iClass also supports an R GUI and Eclipse as well, of course, as partner-based development environments.
Which brings me on to my final point, which is that Netezza is neither getting into the analytic application business nor into the data mining business. Instead it is providing a foundation platform for its partners (like bis2 and Fuzzy Logix) to build analytic applications on. I think this makes a lot of sense: after all, there are far more potential partners with feet on the ground, providing greater coverage than Netezza could manage on its own.
The bottom line is that this is a major step forward for Netezza, differentiating itself still further from the mass of competitors that have yet to implement any sort of in-database analytics. And even those that are doing so are in many cases only providing much more elementary out of the box capabilities.
Along with new releases from lots of data warehousing vendors, TDWI has also seen in the formal announcement of the DataFlux Data Management Platform which was previously known as the Unity project. As the codename suggests this sees the introduction of a unified data integration, data quality, master data management and data governance suite from DataFlux.
A bit of history might be in order. Traditionally, DataFlux has been a data quality (cleanse, profile and so on) vendor. It then introduced MDM. However, data integration was a part of the SAS (DataFlux's parent company) platform and, while SAS did have a development project for data federation, it never really got to grips with this. So the task involved in creating the Platform was to bring ETL and data integration into the DataFlux environment, resurrect the previous data federation development and integrate all of this together with data quality and MDM, all as a part of a single platform. Needless to say this has been a long job.
Indeed, DataFlux is to be commended on keeping to its development schedule. I first got a detailed briefing on Platform back last September and even then the announcement data was planned for this February. In fact the product has been in beta sites since November and it should be generally available during the second quarter.
At the same time, the Platform makes a clearer distinction between SAS and DataFlux as to who sells what. Previously, infrastructure such as data integration software was marketed by SAS but this position was confusing since the sales force is primarily concerned with sales of analytic applications such as Customer Intelligence or, more broadly, the SAS platform as a BI/analytic suite. SAS sales people will still be selling data integration either when the data integration software is embedded within a solution such as Customer Intelligence or in conjunction with the SAS 9 platform as an enabling technology.
This move towards being a data quality/integration/master data stack provider is an increasingly common story in this space. However, integration between the different elements of this stack is a big issue. Some major vendors are so far away from a coherent story about integration that they do not really merit the description of a stack supplier, even if they have all of the relevant components. So the fact that DataFlux now has a genuinely integrated suite should give it a significant advantage over its competitors.
In so far as features are concerned some interesting capabilities include identity resolution, a business glossary, business process and event-driven integration, and a focus on business/IT collaboration. However, more detail will have to wait until nearer the general availability date.
Releases come thick and fast during TDWI. One of the more interesting announcements this week is Aster Data nCluster version 4.5.
Aster Data's mission is to support what it calls big data analytics. It initially addressed this need through support for MapReduce combined with SQL (SQL-MR) and by enabling analytics be fully embedded with the MPP database. In this release the company has introduced an Eclipse-based visual development environment to make development of relevant functions and applications easier, and the IDE makes application push-down a single click function..
Analytics are actually embedded in-database by means of a hybrid environment (introduced in version 4.0) in which, in effect, the company embedded an application server within the data warehouse. In other words all MapReduce functions are executed in this space, which is co-located with the database within the Aster platform and which makes use of shared memory. So you dont get the performance and accuracy overheads associated with traditional environments.
Now, with this release, the company is focusing even further on in-database analytics. In particular, 4.5 includes the Aster Data Analytics Foundation, which provides common functions out of the box. These include time series and pattern analysis functions, (web) sessionisation, core statistics such as standard deviations and moving averages, and market basket analysis.
Finally, the third major element in this release is the introduction of a new (expanded), very granular, management console for administrators. You really need to see this to appreciate it but there is a new query and processing management view, a query timeline graph, a physical and virtual node and partition map, a node overview diagram with detailed drill-down, amongst other new features.
There is no doubt that Aster Data has done well since it entered the market, exhibiting substantial growth in 2009. However, its main claim to fame is its advanced support for MapReduce. As more and more companies add MapReduce capability to their offerings, how will this advantage stand up? There are two main aspects to this: cohabitation with SQL and the in-warehouse MapReduce application server (or equivalent). As far as we know there is no other company that offers both of these capabilities. And, while other companies support both SQL and MapReduce most of them don't have these integrated, and then there is the visual development environment introduced with this release, not to mention the pre-built analytics.
Put all these things together and I think Aster Data remains ahead of its competition in its core market. Of course, there is the question of what that core market is. In other words, where is it that you really need MapReduce? This is more than a short article can answer but two examples would be calculations of moving averages for stock price analysis in capital markets and deep graph analysis in retail or telecommunications where you want to identify influencers. Both of these are difficult to program in SQL and while some vendors are taking the route of extending SQL with specific functions these will always be one-off solutions rather than generic ones. Thus the bottom line is that Aster Data is ahead of the market, in its chosen market, and looks likely to remain in that position for some time to come.
Last week I attended an Informatica analyst event. There was a lot to digest but here are some highlights.
To begin with, the company talked about its acquisition of Siperian. I have already commented on this but one point that emerged at the conference was the way that Informatica describes Siperian as infrastructure MDM as opposed to application MDM. This is a hitherto unrecognised distinction (with respect to terminology) in the MDM market.
Informatica distinguishes the former from the latter by saying that infrastructure MDM is domain and data model independent. By inference it castigates other providers that cannot support multi-domain MDM (from a single product) on the one hand or who base their solutions on a fixed (if customisable) as opposed to a flexible data model on the other. While I agree that supporting multi-domain MDM has significant advantages, I don't think the fact that a vendor has a single domain product necessarily represents a proactive choice that this is a better approachjust that, for one reason or another, they haven't implemented a multi-domain solution. I think everybody agrees, as a matter of principle, that multi-domain MDM is better.
I also think the point about data models is arguable: some companies may (and indeed do) prefer to work off a pre-built data model rather than something that is more flexible.
None of this detracts, of course, from the fact that Informatica is now going to provide very serious MDM competition for the likes of IBM, Oracle and SAP, all of which would fall into the application category.
Another important innovation, at least for Informatica, was the announcement of the Informatica Marketplace. This is designed to encourage the development and exchange of connectors, processes and so on across the Informatica community. Ultimately (not yet) the intention is that it will act in the sort of capacity that open source communities do for sourcing new developments and testing. Of course, this won't apply to the core products but it will take away some of the agility advantages that open source vendors have, so I think this is pretty good move.
A third area of interest was cloud computing. Informatica sees three aspects to cloud computingSaaS, PaaS and IaaS; that is software, platform and infrastructure as a service. We all know what SaaS is, but PaaS means providing Informatica technology (in this case) to developers, systems integrators and so on from within the cloud, and IaaS means providing technology for the IT department (that is running operations on a day-to-day basis). I like this split. In fact, I think I would drop cloud altogether, because I think that's confusing: with one bunch of people thinking that SaaS is a part of the cloud and another thinking that IaaS hosted by, say, Amazon is what is meant by cloud. Not to mention the whole confusion over private versus public cloud. You can find more details at www.informaticacloud.com.
On the SaaS side I have to say I was impressed. There is a really easy, business level, wizard-driven interface for constructing data integration tasks. I am not surprised that the company has some 500 companies signed up for this service. The only thing I was surprised about that was that the same, or a similar, interface was not in Informatica 9. I am told that this will be the case in 9.1
And here's a quickie: there was a user panel session. One of the users has 10 major applications running across 10 databases: perhaps not ideal but understandable. And it has 30,000 Access databases! Now theres a market for the spreadsheet management vendors that also provide Access database management (which all of the big four vendors do). Of course, from an Informatica perspective, you can integrate these into the mainstream environment using data federation and data services (see next) and, once you do that, you not only provide wider access to these resources but you probably also start reducing their proliferation.
One further major topic of discussion were the data services introduced in Informatica 9 to support SOA. The interesting thing here is that it provides increased granularity to the traditional three-tier model. That is, the model that separates presentation from applications and applications from databases. Arguably, the introduction of SOA saw this transition to a four-tier model in which web services sit between the database and applications. However, the use of data services essentially represent a layer in which data is manipulated (through data integration) outside the database, thereby extending the model to five tiers. Applications are now simply variable collections of processes, while web services, which provide those processes, have a bare minimum of understanding of data (just enough to fulfil their tasks) with the emphasis shifting into the data services layer. Of course, the concept of the data-driven enterprise, a concept I wholy endorse, is broader than this, but it puts increased focus on data manipulation, which is Informatica's forté.
Of course there was heaps more: not least on complex event processing and archival (I was particularly interested in the support for application retirement), and how these will integrate with other Informatica products; not just PowerCenter but also with Identity Resolution and B2B Exchange. But I am sure you don't want me to drone on forever.
Finally, of course, there are the two big guessing games: now that Informatica has hit $500m in revenues when will it reach a billion? And, secondly, who will they buy next? Funnily enough the company was unable or unwilling to answer either question.
You often hear security officers, not to mention vendors, talk about fraud detection and prevention but you seldom (never in my experience) hear anyone talking about Bribery. However, in the wake of BAE Systems settlement with the both the UK and US authorities, it is worth paying a little more attention to it. In particular, in the UK there is a bribery bill currently passing through parliament, and it is expected to be passed before the next general election: in other words in the next few months.
One of the provisions of the bill is that companies can be held accountable for the actions of their employees. In order to defend themselves against such charges companies will need to be able to prove that they have suitable provisions and processes in place to prevent bribery in the first instance and, in the second, to detect it when it does happen.
Well, that sounds a lot like fraud prevention and detection. But it also sounds a lot like Sarbanes-Oxley or other compliance requirements. Fraud is something you would like to prevent for obvious business reasons, however there are not, typically, any regulations that require you to have anti-fraud processes in place. You might argue that PCI-DSS falls into that category but that is a special case.
Of course, while bribery is a crime in terms of offering inducements to other people it is also a crime to accept such inducements. In the UK we tend to think of bribery as being something that is only done in foreign countries but that's certainly not the case: I did some consulting for a UK-based public company a few years ago looking into its supply chain and during the course of that work the manufacturing director was suspiciously unenthusiastic about rationalising the company's suppliers and what it bought from whom. Indeed, so suspicious that the CEO and CFO started to look into it and discovered that he was taking backhanders. So there is no place for complacency.
Until the bill is passed, assuming that it is, we won't know the full extent of the regulation and what will be required of companies but it seems likely that appropriate compliance monitoring will be required, along with forensics. If this is the case then those forensics will need to be run on a regular basis. However, whatever is required this looks another opportunity for SIEM (security information and event management) and log management vendors.
It’s been a long time coming but Calpont has finally come to market with InfiniDB. Actually, it launched the open source Community Edition of the product last year but now it is introducing the commercial Enterprise Edition. There is, essentially, only one difference between the two versions and this is that the former runs on a single server only (as big as you like, with no constraints and all the features of the Enterprise Edition) while the latter runs across multiple servers. Calpont refers to the former as offering scale-up and the latter as scale-out.
Of course, there are some practical differences between the two editions but these only apply because of the supported architectures. Thus, there is no high availability option for the Community Edition; similarly, you can’t deploy the distributed cache from the Enterprise Edition and you can’t use parallel loading capabilities beyond the multi-threading supported by the Community Edition. But apart from the limitations of having a single server there are no differences.
The actual product itself is a column-based relational database with a MySQL front end. The secret sauce is what is known as the Extent Map. This is a metadata layer that sits over the data and which learns, retains and uses patterns that exist within the data in order to optimise I/O. It is particularly relevant where there are natural patterns within the data such as all data being time-stamped, so the product will be well suited to log management, telco call analysis, financial trading environments, web analytics and so on. The Extent Map also records information such as maximum and minimum values, number of entries and so on so that certain types of queries (for example, count queries) can be performed without requiring any I/O at all.
The real kicker is the pricing, which is $11,995 per node with discounts for 11 or more MPP nodes. According to the company this works out at between $4,000 and $7,000 per terabyte. Moreover, this is not a subscription licensing model; this is a one-time license fee, though of course you have to add maintenance and running costs. However, this way undercuts the market, even bearing in mind that some competitors can offer better compression ratios and will therefore require less disk space and therefore reduced hardware and software costs. Moreover, Calpont also offers a discount for six one-node instances (which they refer to as an Analytics 6-pack) with the intention of picking up data mart business in larger enterprises.
Calpont is late to the market and it has competitors that are already established. Nevertheless, the market is buoyant and I don’t think it is too late, particularly given the product’s performance (there are some independent benchmarks that have been run against other open source products in which the company did well—but this only applies to the Community Edition), its pricing and its positioning (in conjunction with MySQL). Put these together and InfiniDB should provide some serious competition to its more established rivals.
Informatica 9 is a major release in every sense of the word. This means that there is too much in it to go into all its details in a short article such as this, so I will concentrate on the high-level things. There are three of these: support for data services, pervasive data quality, and business/IT collaboration. However, while I will discuss these separately, for the sake of convenience you must appreciate that these are not distinct and are, in fact, complementary.
Support for data services is not a new concept. Basically, they do for the data hairball what web services do for application spaghetti. However, Informatica has gone a step (or three) beyond its rivals in the way that it has implemented this. In particular, it is based on what I would call business entities and what Informatica calls logical data objects. This is important because business entities are what business people work with (customers, orders, invoices, service history and so forth) as opposed to the tables that developers work with, and this is therefore an enabler for business/IT collaboration. Beyond that, Informatica continuously introspects these data objects in order to recognise changes. This is supported by federated capability that Informatica has written itself (it previously relied on a third party for federated services) that supports this introspection across heterogeneous sources. Also notable are the policy-based governance capabilities provided for these data services, including security, compliance, freshness and quality. So, for example, you can implement masking for sensitive data as a part of the support for data services.
Pervasive data quality is about applying data quality throughout the organisation, not just to a small coterie of people in the IT department and one or two business analysts. There are three main points. First, data quality should be used across domains and not just for names and addresses. Second, as prevention is typically better than a cure, companies should be encouraged to implement pre-emptive data quality capabilities: real-time checking as you enter data into your ERP application, for example. Third, everybody in the company should be (made) aware of how important data quality is to them and their jobs. For instance, would you make the same decisions if you knew that the information you were making those decisions on was 98% reliable as opposed to 68%? I don’t think so: you’d be a lot more cautious in the second case. As a business person you therefore actually need to see those sorts of figures associated with reports and queries upon which your decisions are made. Finally, to enable all of this, Informatica 9 provides role-based interfaces that present the user, whether developer, business analyst, data steward or end user, with just the amount of information they need to do their job most effectively. This will be minimal (and web-based) in the case of the end user but richer, in appropriate ways, for other types of user.
Business/IT collaboration is enabled both by the role-based interfaces just discussed and the use of business entities (which, incidentally, you can import from appropriate data modelling tools) as well as a number of other facilities, though business entities are not integrated with the business glossary yet (it is on Informatica’s roadmap). My own view is that the ‘specification mismatch’ which exists between user requirements and what the developer produces is one of the main reasons why so many companies continue to hand code rather than using a data integration platform: if that mismatch (which exists just as much in hand coded environments) can be overcome through use of business/IT collaboration, which I believe it can, then this will be a major ROI benefit that Informatica can use to overcome the objections of hand coding stalwarts.
If Informatica 9 can significantly broaden the market for data integration tools then one could regard it as disruptive. Further, one could make the same argument about pervasive data quality. However, I am not sure that applying the word ‘disruptive’ to a market leader makes a lot of sense: evolutionary or even revolutionary would be better. Indeed, I think the use of business entities in data integration environments really could revolutionise the way we use these tools and the productivity that can be derived from them. Whatever way you want to look at it, Informatica 9 represents a major step forward.
Cadis Software is a UK-based company (with offices in New York, San Francisco, Hong Kong and Luxembourg) that provides enterprise data management (EDM) solutions for the buy side of capital markets. That is, it provides data integration, data quality, master data management (MDM) and lightweight data warehousing for this sector. The interesting question is why you need specialised facilities in this market: why couldn’t you do what you need with IBM, for example?
There are several answers to this question. The first is that the data sources used are not just conventional back-office systems but also market data from the likes of Reuters and Bloomberg and you will not typically get connectors for this sort of data from the pure-play data integration vendors.
The second is that there are specific data quality issues on the buy side. To begin with, Cadis validates incoming data before you can apply data quality rules. This is akin to data profiling in the sense that you are assessing the quality of the data and generating exception reports. Next, you may not have enough information to tell whether two financial instruments are actually the same or, worse, if one instrument is equivalent to multiple other instruments. So you need some specific capabilities that won’t be in a standard data quality tool. More generally, Cadis uses probabilistic and fuzzy matching to automate matching processes, as well as providing manual capabilities and exception workflow. The company also provides pre-built rules for matching within financial services environments, using standards where they exist (they don’t always: there are no standards for derivatives, for example).
Third, the master data management in this market is rather different, as what you ultimately obtain is a golden copy of positions, securities and accounts/counterparties, which may in turn make use of golden copies of things like prices and assets. In other words, not only are there multiple domains but they all interact so that you can’t really consider them independently or implement them separately, as you would do in most MDM environments.
Finally, there is the data warehousing. This you could do using a third party product. What Cadis provides is primarily web-based reporting. One notable ability is that it can do transactional cubing on the fly. However, the warehouse is not intended for heavy-duty analytics. The data integration capabilities provided by Cadis can be used to load data into a third-party data warehouse where appropriate.
On top of all this the whole emphasis of the suite of products (which you license individually or en masse, as required) is that the people who understand the data should be the people who manage the data. In other words: business people not IT. This is the direction in which the leading data integration vendors are moving but Cadis is several steps ahead.
Data integration tools are a dime a dozen and there are more open source data integration tools that you can shake a fist at. In part, this is because there remains a large untapped market potential for data integration, with lots of companies still insisting that they can do it better and more effectively by hand coding (they can’t).
SnapLogic is an open source data integration vendor founded in 2006 by Gaurav Dhillon, the erstwhile co-founder and CEO of Informatica, so he knows a thing or two about both start-ups and data integration. What’s different about SnapLogic is its focus on web-based sources of data, so that it will support integration not only with SaaS providers but also rich web content and even things like Twitter and YouTube plus, of course, conventional sources like Oracle and MySQL.
However, it’s not really SnapLogic as a data integration vendor that I want to talk about. While the company will claim technical and cost advantages, and it is doing something a bit different in so far as data integration is concerned, with its emphasis on web sources of data, what is very different is SnapStore, which was launched last month and will go into beta in February next year.
The basic idea behind SnapStore is that there are far too many data sources for any one data integration vendor to provide a connector for every such source and when you start to consider combinations of sources with targets then that number increases exponentially. Of course, the major vendors cover the leading databases, ERP systems and so forth but there are lots of obscure and not so obscure environments that they probably don’t, even at the connector level. For example, when did you last hear a vendor talking about its Revelation database connector? Or, to take something more well known: its Sage connector? or its Zoho CRM connector?
The idea behind SnapStore is that you provide facilities for creating snaps, where a snap is anything from a simple connector to a complete dataflow that integrates (say) a SalesForce quote with a NetSuite order. Then you encourage developers to create such connectors, not just for their own purposes (which they need to do anyway) but also to share those connectors within the SnapStore. However, this isn’t just an open source junky type of sharing. Connectors are tested and certified before being placed in the SnapStore and developers are credited with 70% of the revenues accruing from any subsequent licensing of those connectors.
It is early days of course but one can see that this might really drive the development of snaps. And if it does then SnapLogic will become better and better placed as it builds up a larger and larger library of snaps. After all, why reinvent the wheel when SnapLogic can already provide it?