I am only an occasional writer these days as I find that there is very little of note amongst the myriad of BI offerings, and the market leaders amaze me more for what they charge than what they do, but one of the products I follow with great interest is Pentaho. My interest in Pentaho stems back to a time when I was seeking a cheaper alternative to the market leading choices and I was shown Pentaho, which struck me as a competent too, at a fair price. Since then it has evolved to become one of the handful of products that I follow with interest because it has emerged as a leader in the support of the analysis of Big Data without the complexity and mystique that others have failed to shake off. Their vision is to create a single consistent experience across the entire data pipeline, and although a point release, Pentaho 7.1 is a major step towards achieving that goal of a consolidated approach with a fully featured stack, but with the major differentiator of the complexity to execute using that stack not being passed onto the end user. Why this is so important is that there is a startling lack of people to fulfil the much-lauded role of data scientist and without products such as Pentaho that will lead to higher costs, slower execution of plans and loss of competitive advantage.
Now why did I say this point release sparked my interest? Pentaho supports Spark, which is emerging as the preeminent engine for large scale data processing in its visual drag and drop environment, but not via Spark specific data integration logic, which often requires those expensive Java skills that are in such short supply, but by decoupling the logic of what is required to manipulate the data for analysis from the engine that performs it. The execution engine by being decoupled from the logic, the application logic can be simplified, and applied to whatever engine suits best. You start simple and only add complexity where you must, if what is delivered is fast enough you can stop, if it's not you can add complexity add a new engine, exploit the latest advances but only where it's needed not right across the board. The merging of data discovery, data integration and the BI exploitation of the resultant data sets is the goal in a single tool set not in a hybrid integration and Pentaho have taken a giant leap ahead of the market in this release. You can build once and execute on any engine, initially that is Spark and the classic Pentaho engine but the architecture is there to allow run time selection of whatever engine is seen to offer the best fit. They call this the Adaptive Execution Layer and I think we will hear a lot more about this because it addresses not just a technical need but more importantly a business one of how to release the talents of analysts, business domain experts and data scientists working together as a team to be productive, focused and able to deliver meaningful results faster.
Then they have added enhanced data visualisations across the pipeline, so when you are undertaking data preparation users can spot check data for quality issues, they can try out prototypes as they build, and all without having to switch in and out of the development environment. Again this is a very neat technical solution to a very real business issue. As a person who combines looking at products for Bloor with working on real world projects, I know only too well how the business loses faith in IT when we give them a date, and we do everything in our power to hit that date and then we deliver a dud because the system works but the data quality is so poor that it masks what we are seeking to deliver. Pentaho 7.1 users can use heat grids, geo maps and sunbursts to look at the data as the build advances and spot where the issues lie and then use drill down to further explore the data in detail, which should allow enable the data issues to be spotted during the build and eliminated before we reach the release date instead of after!
Many people are concerned about the lack of comprehensive security and authentication within these Big Data environments and Pentaho is addressing that as well. This release has extended the use of Knox and Ranger elements of the Hortonworks' security stack as well offering Kereberos authentication and authorisation to protect clusters from intrusion. The environment can therefore be protected against unauthorised access protecting vital data and reducing risk of exposure dramatically.
There is further support for cloud adoption but I think this gives a good flavour of why this is such a big step forward by Pentaho, and why Pentaho continues to demand that it be taken very seriously when considering the BI stack that you choose to exploit your ever-growing data in the search for efficiencies and competitive advantage.