CrateDB

Update solution on February 28, 2020

CrateDB, which was first released in December 2016, is a NewSQL database. That is, it is a distributed, lock-free, ANSI standard SQL (via a PostgreSQL wire protocol) database with a fully searchable document data store (based on Apache Lucene) underpinning the database management system. It is open-source, written in Java, and based on a shared nothing architecture.

The product is effectively multi-model in that it supports JSON documents, relational data, geo-spatial, full text, binary large objects (BLOBs) and offers time-series capabilities, through the database’s ability to partition the data by any function, and that includes time as a function. While time-series capabilities are supported they are not yet as advanced as Crate.io would like. It has, for example, added time-specific window functions and it supports ingestion from Telegraf and Prometheus, but the company has further work to do in this regard.

CrateDB is targeted specifically at IoT applications and ingesting and processing machine data and it can be deployed both at the edge and more centrally. However, the database is not ACID compliant (it offers eventual consistency) so will be suitable for hybrid environments support real-time operational and analytic processing, but not transactional (in the traditional sense of that term) processing.

Customer Quotes

“Thousands of sensors generate data along our production lines, and CrateDB allows us to analyze that firehose of data 24 hours a day to make real-time improvements to factory efficiency.”
Alpla

“Dealing with sensor data CrateDB is the only database that gives us the speed, scalability and ease of use that our teams, customers and applications require.”
Ganter Instruments

The architecture of CrateDB is illustrated in Figure 1. As can be seen it has a master-less architecture (all nodes are equal), which means that there is no single point of failure. What is not shown in this diagram is that the database uses columnar caches for real-time SQL processing, with indexes stored in memory. These indexes distinguish between datatypes with specific capabilities for numeric, full text, geo-spatial data and so on. While not an in-memory database per se, the company argues that its architecture offers most of the performance benefits of an in-memory database, but without the constraints that an in-memory database imposes.

Fig 01

In this context, note that in IoT environments it is often a requirement to analyse data based not just on streaming data into the CrateDB environment – via Kafka, Flink, StreamSets, Spark or MQTT – but also on large historic datasets. In-memory databases get expensive when this is the case, whereas CrateDB supports conventional storage for this purpose, with historic data typically persisted to SSDs.

Replicas are used for fault tolerance purposes and to assist with performance. From distributed reads there is no distinction between reading from the primary shard compared to any of the replicas. Writes, on the other hand, are synchronous over all active replicas. We should add, however, that the emphasis in CrateDB is on low latency, real-time processing and the product is not ideally suited to supporting long running batch queries (there are no workload isolation capabilities).

Finally, the product has significant capabilities for supporting machine learning, and supports R, Scala, Python, TensorFlow, Spark and Jupyter Notebooks. The fact that the product uses a standard SQL interface means that it users should be able to deploy their preferred business intelligence tools in conjunction with CrateDB.

As we have noted, CrateDB is focused on a particular market segment. However, this is not a niche market but a large and growing sector. For example, industrial IoT environments have very demanding requirements when you want to monitor, predict and control such things as smart factories or smart buildings, and it often requires a combination of technologies: see Figure 2. In practice, the company finds that CrateDB is often deployed as a replacement for combinations of existing databases and this approach can not only lead to improved performance and scale but also to significant cost savings. For example, McAfee, when it first implemented CrateDB internally, replaced forty pre-existing machines with ten, though it has since expanded this to one hundred and fifty.

Fig 02

The Bottom Line

CrateDB is deservedly gaining significant success within its target market. Apart from the inherent performance and scalability offered by the product, the combination of time-series and geo-spatial data, along with support for sophisticated analytic functions, is relatively rare. For these reasons CrateDB merits serious consideration.

Related Company

Crate

Connect with Us

Ready to Get Started

Learn how Bloor Research can support your organization’s journey toward a smarter, more secure future."

Connect with us Join Our Community