The Intel Software Conference 2013 in Chantilly

The Intel Software Conference was in Chantilly this year and, once again, James Reinders’ keynote set the agenda – see his blogs here. It’s all about parallel programming from few to many cores with consistent models, languages, tools, and techniques; and with special emphasis on the Xeon Phi Coprocessor.

The Intel story includes:

Better tools for parallel programming
Better parallel models
Wildly more hardware parallelism
Better educated programmers

Reinders highlighted tools such as Intel Advisor XE to help you design for parallelism and predict the scalability you might achieve; Intel Composer XE to help you build the code; Intel Inspector XE to help you validate the results and Intel VTune Amplifier XE. These are all pretty techie, but the UI (especially for VTune) is much more supportive and accessible than it used to be.

As for new models for parallel programming, Intel now has a couple of parallel programming models, which “yield portability, performance, productivity, usability, maintainability”: TBB (Threading Building Blocks), which is popular for C++ scaling; and Cilk Plus, which helps C programmers and supports vectorisation.

Reinders is particularly proud of its SIMD directives, claimed as an Intel innovation and now (well, mid 2013) part of the OpenMP 4.0 standard. This is about whether the compiler will vectorise your code for (much) better performance; the compiler is pretty conservative (which is safe but can be a nuisance), but SIMD directives force vectorisation regardless. Which is great if you know what you are doing but if you get it wrong, you have some debugging to do. Frankly, this scares me, because whenever the general business programmers I knew got near anything clever like this, we got bugs in production – production outages – but, to be fair, SIMD directives are probably aimed at a different class of high-performance computing programmer. It’s a very useful feature, I think, especially now it is part of the OpenMP standard; but I’d use it with caution.

Wildly more parallel processing, in Intel terms, is currently about Xeon Phi coprocessors, it seems. Offloading processing to a GPU (graphics processor unit) isn’t new, of course, but GPUs have specialised architectures and command sets, and programming them is a bit of a black art. The Xeon Phi is simply a version of the Xeon chip optimised for parallel processing (Reinders suggests that “XEON Phi is what XEON will look like in 5-10 years time”) and it offers opportunities for interesting highly-parallel and flexibly-parallel heterogeneous devices, because it has the same instruction set as the Xeon (which should make programming it easier and reduce complexity). Reinders talks about “supercomputer on a chip” (see performance figures here) and claims that “general purpose IA [Intel Architecture] hardware leads to less idle time for your investment”; it’s possible to move code seamlessly to Xeon Phi and back again, for example. Personally, I like this model a lot; although some programmers I know are less keen (perhaps because GPU specialisation offers greater performance for the subset of general programming the GPU is suitable for, and perhaps because some programmers enjoy programming challenges and complexity). Intel’s vision is to “span from few cores to many cores with consistent models, tools, languages and techniques”, which reads like a pretty good vision ; but I guess we’ll have to see how successful the whole family from Atom through to Xeon Phi is in practice.

Educating programmers in the Intel MIC architecture and how to exploit it is the 4th part of the story and Intel has a good range of books to help. Check out Intel Xeon Phi Coprocessor High Performance Programming by James Reinders and Jim Jeffers at www.lotsofcores.com, for example.

And, that’s probably as techie as I want to get in this blog – the conference went a lot deeper into all this, of course. I’d just like to finish with an Intel customer Incredibuild. This started out with build automation and optimisation – for efficient deployment of developed code into production. However, the techniques and models used to make sure that software build tasks can overlap efficiently – that is, run in parallel for development acceleration – turn out to be generally applicable to many other workloads – that is, for general application acceleration. Parallelism is good, but its not just at the core level. Never overlook parallelism at the macro level and learn from mainframe job schedulers which have been maximising the throughput of work on computers for decades by maximising the parallel processing of business-level ‘jobs’ on linked mainframe computers running in parallel.