VMware Greenplum - A Powerful Data Warehouse for Large Scale Analytics

Large organizations need to answer many questions to effectively run and optimise their business.

What are our best-selling products by market?
Who are our most profitable customers?
What are our most profitable marketing channels and products?
How successful was our latest marketing promotion campaign?

These questions are easier to pose but harder to answer if you are a global company with dozens of subsidiaries. WHY? Because the data about those customers, products and channels is spread amongst numerous systems and data types.

In the real world, this data may become duplicated, out-of-date, or incomplete across the many systems. Likewise, this data may be classified in different ways for legitimate business reasons, so assembling it in a consistent form is no trivial task. Furthermore, applying cost allocation rules to customers, channels or products to determine profitability usually requires complex calculations. These common needs and challenges are what led to the development of the data warehouse, which enables businesses to perform analytics on their data collections and provide consistent summaries from the various sources across the enterprise.

As these data collections continue to grow, the analytical queries become slower and costlier to run on standard relational (SQL-based) databases. At a certain size threshold (perhaps in the hundreds of terabytes or even multiple petabytes) conventional SQL databases yield degraded performance. In this suboptimal state, some complex queries may take an unacceptably long time to complete.

Slow query speed is especially an issue if a large number of queries need to run simultaneously, competing for limited compute resources. This general issue led to the development of massively parallel processing (MPP) data warehouses, where gigantic database queries are executed by co-ordinating across multiple processors (and associated memory). Queries are split up into smaller components and spread across dozens (or even hundreds) of processors, without requiring any programming intervention.

There are several MPP data warehouse options on the market, but Greenplum is the only one based on open-source code. This well-established product has hundreds of large customers such as Dell and Purdue University, who typically use Greenplum to perform business-critical data analytics on very large and disparate datasets.

Greenplum supports complex queries that may involve in-database graph functions, textual data, or geospatial data, in addition to regular structured data such as sales figures and costs. One Greenplum customer runs ten million separate SQL queries per day, demonstrating the product’s high scalability to satisfy the most demanding of use cases. Greenplum can be deployed on-premises, on a VMware Private Cloud or in the public cloud.

Based on its high performance, scalability, and widespread deployment options, VMware Greenplum should be on your short list if you have a very large data warehouse (especially over a hundred terabytes) with demanding analytical requirements.