Co-Processors, GPGPU, and Heterogeneous Computing

I’ve been thinking a lot lately about microprocessors, from the many-core CPUs that AMD and Intel introduced recently to the massively scalable GPGPU processing that’s taking machine learning by storm. After years of consolidation on commodity x86 CPUs, it seems that the computing paradigm is turning again to specialized offload processors. This trend towards heterogeneous computing will change the face of hardware, from mobile devices to the datacenter.

The rise of x86 seemed to wipe away the need for special-purpose processors, but that pendulum is swinging back

The Central Processing Unit in the Main Frame

It wasn’t that long ago that computers were designed with all sorts of special processors. There was a central processing unit (CPU) in the “main frame” but the heavy lifting was handled by an assortment of supporting processors in other frames. And early PCs typically had lots of special processors, from Steve Wozniak’s IWM in the Apple II to the DSP in Atari’s revolutionary flop, Falcon, to the floating point math co-processors often purchased with 8086, ‘286, and ‘386 PCs.

But then Moore’s Law kicked in at Intel in the 1990s and the CPU began integrating more and more features. Soon, mainstream x86 CPUs boasted superior floating point performance and began adding special-purpose instructions for multimedia, encryption, and more.

Today, the mantra holds that software running on commodity processors will always win out over special-purpose hardware. But this simply isn’t true.

For one thing, Intel can only add so many special instructions to x86 before they reach a point of diminishing returns. The MMX instructions that sold Pentiums in the 1990s have been superseded by new generations of AVX instructions today but they’re still there, taking up valuable silicon. And although parts of the industry would love to see SHA-256 and AVX-512, Intel seems unsure if these are worth implementing in mainstream processors.

Revenge of the Co-Processors

Although computing today appears to be centralized on x86, this isn’t really true. From supercomputers to machine learning to game machines, non-CPUs are increasingly carrying the compute load. And this shift is even happening in mobile devices and PCs, especially over at Apple.

The counter-trend started with supercomputers and game machines back in the 2000s.

Sony, Toshiba, and IBM developed a heterogeneous CPU, the Cell Broadband Engine, which coupled a PowerPC core and multiple floating point coprocessor units. Cell became well-known when it was selected as the processor for the Playstation 3, but it was developed for use in high-performance computing (HPC) and in supercomputers. Sadly, some of the other architectural choices in the Cell processor limited its impact outside games.

Intel had similar projects in the works at the time, Larrabee (with many-threaded cores that could be used for graphics or specialized computing) and the Single-chip Cloud Computer (with 48 cores connected with a mesh network). Larrabee and Cell shared some interesting aspects: They could be used for graphics or for general-purpose math, and they could be integrated with a general-purpose processor or used in add-in cards.

As clusters of chips supporting AMD’s 64-bit x86 extensions began to dominate supercomputing, Intel embarked on an effort to develop a truly massive multi-CPU cluster. They evolved those research projects into Knights Ferry, a 32-core PCIe add-in card for scientific computing. The next generation of Intel’s “manycore” processor development gained a formal name: Xeon Phi.

Today, Xeon Phi is the key component of the fastest supercomputers on earth, with multiple petaFLOPS projects deployed, and it has lead to x86 becoming by far the dominant supercomputer platform. But these aren’t just Xeon CPUs, despite the name. For Knights Corner, Intel went back to the history of x86 and dusted off the P64C core from the original Pentium, while today’s Knights Landing uses Airmont Atom cores. And Knights Landing is the first bootable Xeon Phi, though most designs incorporate conventional Xeon CPUs as well.

The Rise of GPGPU

Nvidia took inspiration from the world of supercomputers to create a “general-purpose computing” API, CUDA. The rest of the industry followed with a standard GPGPU API, OpenCL, that allowed mainstream computers to leverage the computing power of the shaders and vector processing units inside GPUs from Nvidia, AMD, Intel, PowerVR, and more to offload simple computing tasks in parallel.

The massive computing power of higher-end GPU cards cannot be denied: With thousands of compute units available, no CPU can keep up. From machine learning to cryptocurrency mining, GPUs aren’t just the best choice, they’re the only thing that makes sense.

Although the shaders and FPUs in them have been opened up somewhat, GPUs still aren’t designed for general-purpose computing. For example, only Nvidia has implemented IEEE-compliant floating point math and double-precision numbers. So the next step is GPGPU hardware designed with more than graphics in mind. Intel is tackling this with Knights Mill, their next Xeon Phi, which is specifically designed for machine learning, and it is widely believed that Nvidia is working on specific ML processors as well.

Stephen’s Stance

It may appear that today’s computers are centered on an all-powerful x86 or ARM CPU, but this is rapidly changing. The next designs are likely to include massively-powerful co-processors, from GPUs to special-purpose chips, that radically outperform the CPU through simplification and parallelization.

Although GPGPU is limited to a few applications in Windows, Apple has made widespread use of OpenCL in both macOS and iOS. But it’s the world of iDevices where Apple’s control of both hardware and software really shines: Every device running current revisions of iOS has a base level of GP-GPU power. And this leads me to a rather interesting conclusion, which I will discuss in my next article.

You might also want to read these other posts...

Comments

UMASREE USA says

July 6, 2017 at 6:51 pm

Thanks a lot for introduce us this news. Very interesting, i should learn more about it.
Sean says

January 21, 2018 at 4:06 am

I think another key reason for the prevalence of heterogeneous computing is the fact that frequency scaling died out almost a decade and a half ago and we also hit a power-wall (I guess those two are closely related). Because of those two issues, we’ve naturally had to go multi-core and adapt workloads to use hardware that was more efficient for those tasks; inversely, hardware had to be tweaked to be able to run general purpose processing tasks that the hardware will be good at – an example of this is when GPUs were made general purpose.