Mixed query workloads – why is it important?

Within the data warehousing community there has been a lot of talk over the last few months (not to mention relevant announcements from Teradata, Dataupia, IBM and others) about mixed query workloads. In this, the first of three articles on the subject, I will discuss why it is important. In the following two articles I will consider why it is a problem and what sort of features one should expect from a vendor offering such capability.

So, why is support for mixed query workloads important? Well, first we had better be clear about what we mean by a mixed query workload. Fortunately, unlike a lot of jargon, this term is fairly self-descriptive. It relates to how you balance different demands upon the data warehouse and, particularly, how you balance short and long running queries, the assumption being that short running queries really have to be short and that they cannot be delayed or impeded by any long running analytics or data mining tasks that are also running.

This is perhaps best illustrated by example. Suppose you are a Telco. One of the main sorts of data that you want to analyse is customer data because you want to understand such things as propensity to churn, customer lifetime value and so on. At the same time you are also likely to want to do conventional slice and dice type things that look at customers by region and so on, and you may also want to do some more conventional data mining. If you undertake such things as customer satisfaction surveys then you may also want to do text mining against that data. And, if you are storing call detail records then you will want to do network analysis against this data (what traffic was routed through this switch and so forth) for capacity planning reasons. And, of course, you need to be able to provide search capabilities against the call detail records.

Now consider specifically the questions about propensity to churn and customer lifetime value. This is precisely the information that a call centre operative will need to have available when dealing with a customer on the phone. Not that he or she needs the figures per se but that the operator will need a script derived from these figures so that the customer can be made an appropriate offer. But in the call centre you can’t hang around waiting for the response to your query about this customer: you need the information immediately. This is what I mean be ensuring that short running queries having to be short.

Of course, this is not limited to Telcos. You can’t have a real-time dashboard waiting for five minutes for a response to a query, or a real-time business process that has an enquiry embedded within it. In particular, look-up queries of all sorts are essentially short running queries. And, while on this subject, master data management, if used to synchronise operational applications, also has the same requirements (which is a good argument for not hosting it in the warehouse since it is transactional in nature).

Data warehousing environments are increasingly required to host these short running queries in addition to their traditional role of supporting analytics and business intelligence. These different types of queries have different requirements: it is the function of mixed query workload management to ensure that the latter do no impinge on the performance requirements of the former.