Intel even more committed to parallel processing

The 10th Intel European software conference – ISTEP 2015 – was in Sevilla this year. It is now several years after Intel announced the end of the “free lunch”, so perhaps now is a good time to look at this again. Crudely put, developers once got a free lunch if they wrote bad, inefficient, serial code, because the next Intel processor was so much faster at serial processing that it just didn’t matter.

Unfortunately, perhaps, the free lunch didn’t end so much as get downgraded to free sandwiches and some people are still coding systems that can’t exploit parallel processing on multiple processors very well. True, ever faster serial processors are being replaced with large numbers of slower processors working in parallel, but the “slower” processors are still quite fast and compilers (Intel’s especially) are getting smarter at making code run concurrently on several processors – in the subset of situations where they can do this safely. The trouble is that, although it is quite easy to run, say, printing operations concurrently, it is also quite easy to write code in such a way that it is forced to queue up for a single processor, so that only a small subset of the processors available actually get much use in practice (and this may not be obvious to anyone reading the source code). Such problems may not show up when you first install the code but become apparent as things scale up over time.

There are increasing opportunities for parallel – concurrent – processing today; not only with many cores available on a chip but also with the availability of processors on dedicated graphics co-processors (GPUs) for general computation tasks. This not only offers the opportunity for sheer performance improvements but also (possibly even more useful for many of us) an improved user experience, as users can continue to work with a responsive system using some processors, while heavyweight tasks are running on other processors. Intel Xeon Phi co-processors take an interesting approach, which gives programmers more freedom of choice – rather than using a different programming model to run computations on a co-processor really designed for graphics processing, it offers the same programming model as used on the main CPU but optimised for parallel processing performance.

James Reinders is Intel’s leading spokesperson on tools for concurrent processing, a noted author, and its parallel programming evangelist. He spent quite some time in his keynote pointing out the differences between using a few very powerful processors versus using very many, very much less powerful and very restrictive processors offering a lot more computational power in total, if your algorithms support concurrent processing (Rainders won’t comment on other firms’ products but I’d think that an NVIDIA graphics co-processor card using the CUDA programming model would be a typical example). He then contrasted this with using Intel’s preferred co-processor – Intel Xeon Phi – approach, with many less powerful processors optimised for efficient concurrent processing, but using the same programming models, languages, optimisations and tools as the main CPU. There’s no need, Reinders says, for a dual programming architecture; I think that this could help to reduce the barriers to the wider use of effective concurrent processing designs.

High performance computing is becoming less of a specialised interest, as general business applications increasingly want to run ever more sophisticated algorithms against ever larger datasets – which is only going to scale if parallel (concurrent) processing can be exploited efficiently. Reinders points out that all the latest standards – Fortran 2008 (approved in 2010!), C++ 11, C 11 – have support for concurrent processing. OpenMP, he says, is widely used for HPC (High Performance Computing) now; TBB (Threading Buiding Blocks) is mostly used with C++ and for more general computing, as well as HPC; and Cilk Plus is more visionary: “helping us participate in influencing future C, C++ and OpenMP standards”, he says.

Nevertheless, I do worry about how fast the industry in general is at taking on board what is now possible. I find it worrying, for example, that Adobe has only just started to make use of dedicated graphics cards in its Lightroom photographers package with its recent release of Lightroom 6. By all accounts, the performance improvements are welcome, but why only now? Most people who hadn’t looked at it in detail probably assumed that Lightroom had been exploiting graphics cards for years; I wonder how many other programs don’t fully exploit modern architectures and have got away with it because the “free lunch” was only withdrawn gradually?

Most good developers now realise that they must exploit parallel processing in order to remain competitive if performance is important – and when isn’t it? Luckily, since coding for parallel systems is hard, Intel (for instance) produces some excellent tools that help you write parallelsable code (you can tell the compiler that certain operations are safe to run in parallel); that check the safety of parallel code (possible race conditions and deadlocks and so on); and which help developers visualise how their code is running across several CPUs. Nevertheless, some programmers are too proud of their expertise to use such things as code visualisation tools. Some even complain that Intel’s tools don’t work, when they do try them, because the tool finds problems in their code! Unfortunately, investigation then shows that the tools do work – this isn’t a technical issue so much as a cultural one. Development shops need to inculcate a culture where developers take more pride in showing that their code shows up well in the latest execution visualisation and QA tools than in their ability to code without assistance.

The writing is on the wall – the latest Intel Xeon Phi co-processor doesn’t make allowances for serial code (it will run, but slowly compared to parallel code that can run on several processors). I don’t think that we can afford to let up on highlighting the need for training in parallisation and associated techniques just because multi-CPU systems are now ubiquitous and widely in use. There is some excellent reading available: Reinders’ “High Performance Parallelism Pearls” looks like an interesting read for high-performance computing specialists and perhaps it will increase understanding of concurrent processing for more general programmers too

Developers who write high-performance code but who don’t deeply understand parallel processing (and may, in the past, have got away with relying on the excellent optimising compilers available today) need to get their act together. And it’s not just about performance – badly parallised programs can generate interesting bugs that only appear at scale and will impact trust in a computing platform – and can be hard to reproduce in small test systems. If tools are available on your platform to help you visualise and quality-assure parallel code you should use them (or make sure your developers do; perhaps managers should take an interest, at the visualisation level, in how effectively any code they are ultimately responsible for exploits concurrent processing effectively – a governance issue). And if your development platform doesn’t provide assistance with writing code that runs well, in parallel, on many processors, perhaps you should extend it with extra tools – or find a platform that does.