Analyst Coverage: Philip Howard
Yellowbrick Data was established in 2014 and the Yellowbrick Data Warehouse first appeared in 2016 with general availability in 2017, though the company did not really emerge from stealth mode until the following year. The company, which is backed by venture capital, has its headquarters in the United States with additional offices in London and Hong Kong.
Last Updated: 1st July 2020
Yellowbrick Data Warehouse is a massively parallel data warehouse available on-premises as an appliance or there is a multi-cloud (Amazon, Azure and Google) Cloud Data Warehouse option, which provides the Yellowbrick environment as a (hybrid cloud) managed service. The latter is the company’s primary focus, which combines all the benefits of managed cloud services, including disaster recovery, alongside the performance gains offered by the hardware architecture. The warehouse can also be implemented on a private cloud if required.
The company’s products are targeted at traditional enterprise data warehouses with, as of January 2020, a maximum of 3.5Pb. As withdrawal of Netezza support by IBM started in June 2019, Yellowbrick is also targeting Netezza replacements, not least because Yellowbrick, like Netezza, is based on PostgreSQL. The company has also added a library of functions specifically tailored to provide Netezza compatibility.
"In our testing of Yellowbrick, we compared the performance of a six-rack (Netezza) TwinFin to the six-U (30cm high) baseline Yellowbrick system. And performance was anywhere from 3 to 50 to 100 times faster.”
The fundamental principle behind Yellowbrick’s thinking is that traditional data warehousing architecture with spinning disks is simply old-fashioned. Its view is that even more modern, in-memory based systems with flash disks simply transfer bottlenecks to processing from disk to memory. In these architectures, incoming data goes to memory which, in turn, leverages, or tries to leverage, CPU cache. Yellowbrick argues that this is the wrong way around: that it is better for data to be directly processed by CPU cache (L1, L2 and L3: L3 in the first instance) with CPU cache and memory-based capabilities interacting, as illustrated in Figure 1.
The company also argues that the current trend towards separating compute from storage is an illusion. Certainly, there are environments where you do not have much in the way of seasonality – workloads are more or less consistent – where it does not bring any advantage. More specifically, Yellowbrick contends that, yes, it may be appropriate for some smaller environments and, yes, it may have advantages in cost terms when you can scale compute power up and down. However, its view is that the problem is that the interconnect is typically too slow and that the time needed to warm up caches means that performance is impaired. There is some truth in this argument and the desirability of separating compute from storage is not as clear cut as some vendors might have you believe. Indeed, even some of the suppliers that offer this approach do so only as an option.
More generally, the broader architecture of Yellowbrick is illustrated in Figure 2 with notable features including parallel loaders, a fast row store as well as columnar storage, a cost-based optimiser, workload management, a system management console and a customised (vectorised) SQL processor. Note that in the latter case this replaces the standard PostgreSQL processor as Yellowbrick does not believe that that is fast enough. In the same context it is also worth commenting that as a product built on top of PostgreSQL you should be able to leverage the PostgreSQL extensions supporting geo-spatial and time series data, which will be important in Internet of Things environments. Not shown in this diagram is the fact that Yellowbrick offers asynchronous replication across Yellowbrick instances regardless of whether these are on-premises or in the cloud.
Not everybody loves appliances. They are certainly convenient for on-premises deployments and modern appliances are far more flexible than used to be the case. However, old prejudices die hard. Nevertheless, there is no question that purpose-built hardware, working in conjunction with software built for that hardware, can deliver the best possible performance. In addition, Yellowbrick is available as a managed service in the cloud and most people don’t – and shouldn’t – care what hardware they are running in such an environment. What they will care about is the cost and, at least in theory, Yellowbrick should have a significant advantage in this regard, since the use of optimised hardware should offer superior price/performance. As an example, one of the company’s now clients ran a proof of concept comparing Yellowbrick with Google BigQuery as an in-cloud analytics platform, finding that the former was six times faster than the latter.
One feature we would like to see is support for external tables (or similar) to allow federated queries between Yellowbrick and other environments such as Amazon S3 or Hadoop. Of course, there are third-party tools that support this but other data warehousing vendors are increasingly building this sort of facility into their offerings.
The Bottom Line
Appliances have largely fallen out of fashion, even if they are convenient and speed up time to value. So, we expect Yellowbrick to receive some resistance on this front. However, that should not apply to either in-cloud or hybrid cloud deployments where Yellowbrick can provide demonstrably better performance along with all the benefits offered by a managed service.