What exactly is in-memory?

“In-memory” is becoming increasingly popular within the data warehousing community. The cynic in me expects that traditional caching technology will soon be re-branded by marketing as “in-memory”, which, of course, it is, but that’s not what people are increasingly coming to expect from in-memory technologies. So, what exactly do people expect? Well, that depends: there are lots of different ways that you can accelerate analytic performance through the use of in-memory techniques.

The first is traditional caching. This is typically used to hold the results of queries or sub-queries that you have just run, within the cache, on the basis that it is likely that the same or similar results may well be required by further queries in the near future and this data can be reused from the cache. Intelligent caching options will ensure that you only cache results that are genuinely likely to be reused.

Going beyond caching, a good use of in-memory technology is to hold indexes. This will mean that these don’t have to be read from disk as a part of query processing and will therefore speed up analytics whenever the use of indexes is appropriate. Of course, indexes can get very large so it will help if index compression is supported and you may also need facilities that determine which indexes are held in-memory and which are not if the amount of in-memory capacity is not sufficient to hold all of your indexes (if indexes are held in memory there will be a tendency to spawn more and more indexes). Also, not all queries will use indexes even if they are available: whole table scans for example, so this is by no means a panacea.

Of course, not all data warehousing vendors use indexes, in which case you can use in-memory technology for something else, typically metadata-based acceleration. Companies like Netezza (IBM) with its zone maps and Infobright with its Knowledge Grid improve performance through the use of metadata and deploying this within memory can improve performance still further.

The third option is illustrated in the recent joint announcement by Teradata and SAS of the “Teradata Appliance for SAS High-Performance Analytics.” This too uses in-memory technology but the best way to think of it is as an extension to in-database analytics. In-database analytics allows analytic applications, statistical functions and data mining to be performed inside the data warehouse without having to extract the data. It is therefore faster and allows you to run against the entire dataset rather than having to sample it. What this appliance does is to make this environment even faster through the use of in-memory techniques. It is interesting to note that when I recently wrote an article about this release the emphasis put on it by SAS was not so much about faster queries (though they no doubt are) but that analytical data preparation and model development would be faster. So in-memory technology is not just about queries.

Next there are in-memory databases. These have been around for years and both IBM and Oracle, as examples, have successfully used SolidDB and TimesTen as front-end caches for DB2 and the Oracle Database respectively. However, the world has moved on and now we have the likes of Oracle Exalytics and SAP HANA. These are both based around the idea of doing everything in memory. This is fine if you have enough memory and you should get great performance. If you don’t you either have to determine which data is “hot” and put that in-memory but accept that you will get poorer performance for colder data or you will need to add one or more additional servers.

In practice, of course, you are not going to get a whole data warehouse into memory unless you have very little data or a lot of money. These in-memory databases are really for smaller data marts (up to a few terabytes), to support specific analytic applications or to accelerate a “hot” part of a broader data warehouse. The one exception to this rule is uRiKA, the graph database from YarcData, where the whole thing is predicated on the assumption that everything is in-memory and there is no such thing as cold data.

Anyway, the point is that “in-memory” can span a lot of different capabilities, no doubt including things I have not mentioned. It will be worth bearing this in mind when vendors talk to you about their in-memory capabilities, particularly if “in-memory” becomes a fashionable word in the hype vocabulary.