Multicore – another IT Crisis

Intel seems to think of itself as a hardware company but it is also a serious player in the software arena, with extremely sophisticated compilers, including what is perhaps the most broadly adopted (and highly regarded) FORTRAN compiler, still much used for computationally-intensive high performance computing.

Even more interesting is that it supplies excellent tools for debugging parallel programs running on Intel’s multicore computers (that is, on Intel’s x86 architecture computers using chips with several “cores” or CPUs internally). Soon, every computer you can buy will be multicore (for reasons associated with controlling heat production) and only programs which can run on many cores at once will full take advantage of them. In fact, since individual cores on a multicore computer generally run at a slower clockspeed than the single cores on today’s computers (again, to control heat production), applications that can only run on one core at a time will slow down as you buy newer machines (or, since newer internal designs compensate for the lower clockspeed, not run much faster). Has anyone ever wished to buy a new computer and not see things markedly speed up?

Unfortunately, as well as promising increased throughput, running in parallel on several cores brings the possibility of several interesting “new” programming bugs (to be accurate, they aren’t really new; as they are possible, but less likely, in existing systems). These are essentially “non-deterministic”: that is, they only appear sometimes, typically when the programme is stressed in production, and may not show up in testing on small amounts of data.

The first of these is the race condition—the programmer assumes that operations run in a certain order, which they do, when running on one core doing one thing at a time. However, with several cores sharing the work, which one finishes first is often a matter of chance and things may run in the wrong order.

The second new bug is the deadlock. This occurs when a process running on one core holds a resource while waiting for a second resource to be released, if at the same time the process holding the second resource is waiting for the first program to release the resource it is holding. A deadlock brings both processes to a stop and the program they are part of never finishes.

Intel’s tools, that help address these problems on Intel technology (you’ll need to find equivalent tools on other platforms), are Intel Threading Building Blocks, which helps you write parallel code that works properly; VTune Performance Analyser and Thread Profiler, which detect potential or actual performance problems; and Intel Thread Checker, which detects deadlocks and race conditions.

However, this isn’t just a programming technology issue. Management has to encourage programmers who are coding for x86 multicore architectures, using Intel compilers, to use these tools as routine. If they are coding for different parallel programming technologies, they should be using something equivalent—parallel programming is non-intuitive and you need tools to help you understand what is going on.

For example, Levent Akyil (Software Engineer, Performance, Analysis, and Threading Lab, Intel Software and Solutions Group), who was presenting at Intel’s recent Software Conference in Prague, demonstrated the case of a simple programme loop calculating the value of pi. It works on a single core, at a certain speed (which you have to remember to benchmark; else, how can you tell if your multicore implementation is efficient). However, moving the parallel version of the program (after threads have been introduced) to a multicore machine (and remember that “single-threaded” applications will only ever run on one core at a time no matter how many are available), might generate wrong results (and as this is “sometimes”, you may not always notice, which is worrying). This is because both cores are trying to process the same variables, a potential race condition, and sometimes something gets overwritten before it can be used, an example of a data race in action. This is easily fixed by putting a “lock” on the variables you are changing, until they have been processed. That’s fine, but the calculation now crawls on two processors, running orders of magnitude more slowly than on a single core—because, it turns out, you put the lock inside the loop (and therefore are constantly requesting and releasing it, which are “expensive” processes). So, you move the lock outside the loop—and it still runs slowly because of contention within the chip—each iteration invalidates the cache needed by the other processor, which then has to be reread, slowly, out of main memory. Fix that, and the calculation runs about twice as fast on 2 cores as it does on one. However, to recognise and fix these problems, you need tools that help you to see what is really going on.

Now, the point Akyil was making here is that parallel programming is hard and sometimes non-intuitive – but that Intel’s tools (VTune etc.) are really good at letting you visualise what is going on in this sort of situation, if you remember to use them. Akyil cited one company which had what it thought was a parallel programme—which ran serially much of the time without anyone noticing, until they ran VTune Analyser and Thread Profiler. It is easy for a programmer to assume s/he understands the issues when s/he doesn’t. This implies a need for training; and management recognition for people doing the job properly rather than taking shortcuts, which are (largely) management people and process issues.

James Reinders
photo ©
David Norfolk

James Reinders (author of “Intel
Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism”, and Director
of Marketing and Business at Intel), gave the keynote at the Prague conference and himself raised some interesting issues with multicore and “thinking parallel”, which Intel
definitely sees as the future of computing:

How can managers distinguish the rather cosmetic messages about Parallelism coming from some companies, from (what he claims is) Intel’s much more solid offering. “Thinking parallel” is non-intuitive for many programmers, let alone their managers, and the devil is in the detail. The answer is, to get independent expert advice on this issue—now, before a (possibly dysfunctional) solution is forced onto you by circumstances.
FUD—Fear, Uncertainty and Doubt—is a real issue, according to Reinders. Some vendors may use the multicore crisis to panic you into buying technology which doesn’t really meet your needs. He quoted compiler companies claiming to match Intel’s math library capabilities—but without all the parallelism support in Intel’s libraries. And the promotion of GPUs (graphics processor units) on video cards as a general-purpose parallel processor you can offload business work to—but which have real ease-of-programming and maintainability issues, because GPU programming is currently so hardware-specific.

However, we should put the multicore crisis, which is a real one (especially for high performance desktop systems) in perspective. As Reinders allowed, some platforms (the mainframe z series and big Sun servers, for example) have been doing parallel processing for years. The Java application servers commonly used in business are inherently capable of processing their workloads in small parallel chunks.

There is parallel expertise out there, if not necessarily where you think you want it. In fact, Ray Jones (Worldwide VP, z Software at IBM, speaking at an Analyst seminar in Marlow, UK recently) seems to be pinning his zSeries (System z) mainframe software strategy on allowing his customers to deal with the multicore crisis. As well as handling transactions and the application server, well-suited to parallelisation, the zSeries has robust virtualisation and prioritisation services and the z10 can emulate a fast (about 4GHz) single core processor for your legacy single-threaded applications (running in parallel with other processors), if you need one while migrating to the ubiquitous multicore platforms.

We think that there are many ways to address the coming multicore crisis, only some of which require you to rewrite your applications for massively parallel programming (check out Pervasive’s solution for certain kinds of database programming, for example). However, our message to corporate managers is that they should be starting to make “fact-based choices” of the competing strategies for multicore now.

And
a good start might be to involve developers in the debate—perhaps to buy them some books and point them at whatif.intel.com where Intel makes some of its experimental
research-oriented software for the new multicore world available for them to
play with (its finished products are here). The presentations from Intel’s Software
Conference are also available on the site link.