Analyst Coverage: Philip Howard
Starburst Data offers commercial support for the Apache Presto project. Originally that support – and contributions made to the Presto project – was provided by Teradata. However, in late 2017 Teradata spun out Starburst Data as an independent company. The company is backed by venture capital and has recently announced $42m of B series funding. The company’s headquarters are in Boston and, as well as other offices in the United States, it also has a European office in Warsaw.
Last Updated: 1st July 2020
Presto is an open source distributed “SQL on Anything” engine for running interactive analytic queries. It has the ANSI standard SQL engine you would expect from a database. It doesn’t include its own storage mechanisms, but it allows you to query data in any storage device be it in distributed storage or a database fully separating compute from storage. It reflects the current trend towards a separation between compute and storage. The corollary to this is that you can use whatever storage engine (see Figure 1), or combination of storage engines, as is suitable for your application. The company tells us that it takes between one and three months to support additional storage options such as Db2, Greenplum or Vertica and the company is continuously working with the open source community as well as its customers to add new connectors, based on demand.
This approach means that you can scale compute separately – there is an autoscaling feature – from your storage requirements. You can also use the front-end business intelligence tool of your choice. In turn, this means that Starburst Enterprise Presto is most commonly deployed to support query federation across multiple data sources.
Presto is available under an Apache license, for which Starburst provides commercial support, as well as offering Starburst Enterprise Presto. The company is a major contributor to the Presto project, in fact the founders of Presto are also founders of Starburst, as are companies such as Facebook (which developed Presto in the first place), Slack, Grubhub, Comcast, and FINRA. The product may be deployed in the cloud, on-premises, or in hybrid environments.
“Presto is amazing. Our lead engineer got it into production in just a few days. It’s an order of magnitude faster than. Hive in most of our use cases. It reads directly from HDFS, so unlike Redshift, there isn’t a lot of ETL before you can use it. It just works.
“FINRA monitors market data for trading fraud. Starburst Presto separates compute and storage, making it possible to scale economically and analyze 25PB of data – 100B rows of new data per day from
Presto is a massively parallel distributed system that runs on a cluster of machines. A full installation includes a coordinator (which enables high availability) and multiple workers, as illustrated in Figure 2. Queries are submitted from a client such as the Presto CLI (command line interface) to the coordinator. The coordinator parses, analyses and plans the query execution, then distributes the processing to the workers. Specialised connectors are available for Cassandra, MySQL, Google BigQuery, ElasticSearch, Oracle, MongoDB, Snowflake, PostgreSQL and many others, while there is also ODBC and JDBC support. There are Presto client libraries that support C, Go, Java, Node.js, PHP, Python R and Ruby. Also notable are the in-memory capabilities, the use of vectorised columnar processing and integration with Kubernetes, which allows the deployment on any cloud and on-premises
The product does not currently support push-down query capability but the company intends to introduce this in 2020. This will be two-way to the extent that you push-down when that is appropriate but refrain from doing so if the source database is overworked.
A major feature of Starburst Enterprise Presto is that it offers a cost-based optimiser that is the result of a collaboration between what is now Starburst and Facebook, as opposed to the less capable optimiser used in standard Presto distributions. It has been designed specifically for Presto, as opposed to the Apache Calcite project, which is more of a generic optimiser. Another major feature that was previously contributed by Teradata is spill-to-disk, which is designed to support query processing when you run out of memory. There are a number of other in-memory engines which grind to a halt if you run out of memory. Workload management capabilities are provided along with resource groups.
The product has strong security capabilities, with support for LDAP and Kerberos, and you can inherit security details from the storage environment. In addition, Starburst ensures Presto security & governance with role-based access control, data masking and encryption (both at rest and in motion), column and row level security, and integration with Apache Ranger. And finally, the company has recently introduced Starburst Mission Control as a management console to manage Starburst Enterprise Presto clusters across platforms and data sources. It allows you to create, access, and manage multiple clusters, even across hybrid cloud environments, from a single intuitive user interface.
It is currently available on AWS and Kubernetes, which covers both cloud and on-premises deployments.
There are two questions to answer. Firstly, why choose Presto? And secondly, why prefer Starburst Enterprise Presto compared to other versions of Presto? In the first case, the ability to scale storage and compute separately is a major benefit. As is the ability to have heterogeneous storage engines, with built-in query federation. Also relevant is that there is no vendor lock-in: if you want to change your storage engine then Presto can accommodate that.
As far as Starburst is concerned the key reasons for adopting this version of Presto is exactly the same as applies to other open source products: you get support, high availability, enterprise connectors, security and the latest performance improvements.
The Bottom Line
Starburst was created because the founders believed that was a large market opportunity to create an enterprise-grade version of Presto. This is undoubtedly true. However, it is worth bearing in mind that Starburst has its origins in Teradata: a company that has had decades of experience in optimising analytic performance. This experience is evident in the various Starburst Enterprise Presto offerings.