In Barcelona for the Intel Software Conference

I’ve been speaking recently at an Intel Software Conference in Barcelona about “thinking parallel”. I was saying, for instance, there is little point in high concurrency parallel programming producing reports faster—if the report sits on someone’s desk for a week, queuing up in a serial authorisation process with a single overloaded (human) processor.

Thinking parallel means thinking about concurrency in business processes too, something that isn’t often mentioned because (I think) many of the thinkers in the parallel processing space come from engineering and HPC (High Performance Computing) backgrounds where the problems are somewhat different to those in the mainstream business space. As are the architectures—general business computing is usually on a “shared memory” architecture so what one processor is doing can easily affect what others are doing; whereas HPC is often “shared nothing” (which, in some ways, is easier).

A real issue, now that performance improvements in mainstream computing will come from increasing (energy efficient) parallelism rather than ever-increasing clock speeds, is that many of the tools that developers need to help them cope with concurrency are built for HPC academics and performance specialists rather than mainstream business programmers. This makes some of the developer-friendly enhancements for parallelism in Microsoft’s Visual Studio 2010 welcome. However, I was a bit taken aback when Steve Teixeira (Product Unit Manager of Parallel Computing Platform Developer Division, Microsoft Corporation—where do they get these snappy job titles) said that feedback from developers using VS 2010 was that help with the “correctness” of parallel programming was what they wanted next. I’d have thought that “correctness” (avoiding race-related result errors and deadlocks) was about the first thing a developer would need! Still, if there’s one thing Microsoft is good at, it’s making development tools that developers like, so I’m looking forward, as usual, to the next release.

James Reinders (Director of Marketing and Business for Intel Software Developer Products), did pull me up on a rather throwaway remark in my talk, something like “you can’t just rely on the compiler to look after concurrency“: “it’s true,” he says, “but it deserves a whole article, not just a sentence”. Well, yes, perhaps its full implications were a bit hard for the audience to take in after lunch and James goes on to make a good point: you can’t rely on the compiler to choose your algorithm for you; you can’t rely on the compiler to discover parallelism, in general; but you can rely on a compiler to package parallelism you’ve identified and no programmer should try to compete with a compiler in coding up the details of parallelism.

So, here’s what I mean in a bit more detail than I went into in the presentation (which you can find, with my explanatory notes, here). “push button” programming is a dream but it hasn’t been fully realised for serial programming, so promising it for parallel programming (as in “write serial logic, let the compiler take care of breaking it up into threads and tasks scheduling it across parallel processors“) seems rather optimistic—and, of course, neither intel nor microsoft (just two examples) actually do make this promise. they promise to assist developers, by providing high-level abstractions for parallel programming, not to do all the work for developers. nevertheless, i remember once telling a developer that he needed to think about concurrency issues when developing a “groupware” application for the enterprise, many years ago, and being told that “the compiler will look after it“—it’s an attractive idea for many developers, i suspect.

and james confirms this: “oh—if i had a penny for every time a customer said “won’t the compiler just do it?”… well, i’d be rich,” he says, and goes on to say, “i’ve had to break this news over and over. where do people did this high opinion of compiler magic? compilers are amazing—but not that amazing… and they aren’t going to get there any time soon. the task to take a serial program and transform it to parallelism is generally an algorithm transformation. that won’t happen without the sort of intelligence humans still have and machines do not. in scientific code, loop-based mathematics can have parallelism that is “discovered.” this may feel common in scientific code—but it doesn’t apply to programs outside this domain… because scientific loops run a long time and process a lot of data… that is key“.

think of the issues the compiler faces. for a start, it has to know how many processors are available at run time and although the latest compilers and compiler extensions handle this pretty well, you do need to be using the latest compiler technology which can abstract “tasks” and handle allocating them to processors at run time; james claims that intel’s “openmp and tbb [intel’s supported abstractions for parallel processing, explained below] both do this with virtually no overhead“.

in theory, the compiler could perform a static analysis of the code and determine code blocks that are candidates for parallel running. however, i believe that it would have to be very conservative about using this information; as the safety (or not) of running code in parallel threads can only be fully determined by dynamic analysis at run time on the specific workload of concern.

so, what you can do is give the compiler “hints” and tell it what loops, say, can be parallelised and what processing threads can run in parallel. however, even then there are issues. not least of these issues is whether the “hint” is correct—perhaps two pieces of code can safely run in parallel with the business process employed in london, say; but a slightly different business process employed in shanghai, possibly, might mean that sometimes a race condition or a deadlock can develop. there’s a more complete description of races and deadlocks than i want to go into here, with code examples, on the microsoft support site; it refers to multithreaded visual basic .net but the issues are general.

i believe that there’s partly a computer science issue here: you can’t, in general, prove that a computer program (the compiler) can completely and correctly automate a solution to a problem (such as scheduling program threads across an arbitrary number of processors for an arbitrary workload). this is probably good, as it keeps developers in a job! james agrees, “compilers are a collection of approximations to np-hard (or worse) problems. yes, compilers cannot, in general, be proven correct—but compilers using parallelism is not unique in that regard“.

however, that said, just think of the simple practical difficulties a compiler would face even with “good enough push button automation” of the production of multithreaded object code which will automatically schedule itself (or allow itself to be scheduled) across an arbitrary number of processors.

to start with, thread handling is an overhead, so (regardless of workload) the code mustn’t over-provision itself with threads (but what “over-provision” means rather depends on the umber of processors and the system configuration found at run time).

then, it must avoid any possibility of deadlocks—which are (roughly) a consequence of having locks on resources, particularly locks held for a long time. however, you must have some locks as, otherwise, things may process in the wrong order. even if this gives rise to a potential deadlock, this may not matter in practice if the locks are released quickly—but releasing locks quickly depends on the actual workload at run time (both workload volumes, since a particular thread on an overloaded machine may run unexpectedly slowly; and also on workload characteristics).

and, of course, the code must always give correct results regardless of possible race conditions (a race condition arises, roughly, when two threads are updating the same object and whichever happens to run last overwrites the changes made by the other thread). this means that each thread must be processing independent units of work; or, if not, that critical operations must be protected with locks (remembering that, in general, the more locks the less concurrency you’ll achieve). again, whether a potential race condition is met in practice depends on workload (on an overloaded machine a thread which usually completes well before another thread wants to update a potentially shared object, might not finish in time on occasion). intel provides tools which will let a skilled developer recognise and address potential race conditions but they do need user understanding and input.

these issues with developing code for optimal concurrent processing are not new and in thirty or more years since ibm first produced multiprocessor mainframes for business computing, things haven’t changed much, except in degree. we expect many more processors to work in parallel these days on much less specialised systems (concurrency in business transaction processing systems using a mainframe database management system is comparatively straightforward). the tools which help us cope with concurrency have got a lot better and we now have coping strategies around using multithreaded utilities (database management systems, application servers, message queue managers etc.) which let us get some of the benefits of parallel processing without thinking about it too much. nevertheless, the issue is arguably harder to deal with these days (more sophisticated platforms, more complex applications and developers used to serial coding), so we need better (and more developer-friendly) tools.

nevertheless, there is hope. as james says: “i think most developers think the new models are about making the coding easier, or the parallelism clearer. yes—they do that. but their real value is reducing (or eliminating) the sources of bugs—both in correctness and scaling. openmp, tbb, cilk and ct—all help enormously with this. as do microsoft’s tbb-line solutions of ppl and tpl. that’s the real revolution going on. one that asks programmers to code in a way that steers them away from disaster, and asks the compiler to help only in ways that it can“.

openmp (open multi-processing—(see wikipedia) is a portable, scalable model for developing multi-threaded shared memory parallel applications in c/c++ and fortran (intel has one implementation, there are others).
tbb is intel’s popular threading building blocks library for parallel applications in c++. in intel’s words, tbb “represents a higher-level, task-based parallelism that abstracts platform details and threading mechanisms for scalability and performance”, it’s rather more than just a threads-replacement library.
cilk is a general-purpose programming language based on gnu c designed for multithreaded parallel computing in hpc applications—cilk introduces a few keywords to direct parallel programming; if you take these out of a cilk program, it compiles as a serial c program. cilk arts developed cilk++ for more general commercial applications in c++ and this was acquired by intel in 2009.
ct (“c for throughput”—only a working name, not what it will be called if it is released commercially) is a generalised data parallel programming solution under development by intel, that aims to free application developers from dependencies on particular low-level parallelism mechanisms or hardware architectures—you can download the beta here.
tpl is microsoft’s “task parallel library” in visual studio 2010, a domain-specific embedded language for expressing concurrency that lets programmers express potential parallelism in existing sequential code.
ppl is microsoft’s parallel patterns library in visual studio 2010, which provides support for task parallelism; generic parallel algorithms that act on collections of data in parallel; and, generic parallel containers and objects that provide safe concurrent access to their elements.

“if you have been affected by any of the issues raised in this article, professional help and counselling is available.” well, for example, starting may 12th intel’s parallelism evangelist, james reinders, is interviewing microsoft evangelists steve teixeira and herb sutter and they’ll be talking about concurrency issues—register to join in here. I haven’t seen this video in advance, but I always enjoy talking to James—he talks about the real issues rather than just about marketing Intel, in my experience.