Co-Processors, GPGPU, and Heterogeneous Computing

June 26, 2017 By Stephen 2 Comments

I’ve been thinking a lot lately about microprocessors, from the many-core CPUs that AMD and Intel introduced recently to the massively scalable GPGPU processing that’s taking machine learning by storm. After years of consolidation on commodity x86 CPUs, it seems that the computing paradigm is turning again to specialized offload processors. This trend towards heterogeneous computing will change the face of hardware, from mobile devices to the datacenter.

The rise of x86 seemed to wipe away the need for special-purpose processors, but that pendulum is swinging back

The Central Processing Unit in the Main Frame

It wasn’t that long ago that computers were designed with all sorts of special processors. There was a central processing unit (CPU) in the “main frame” but the heavy lifting was handled by an assortment of supporting processors in other frames. And early PCs typically had lots of special processors, from Steve Wozniak’s IWM in the Apple II to the DSP in Atari’s revolutionary flop, Falcon, to the floating point math co-processors often purchased with 8086, ‘286, and ‘386 PCs.

But then Moore’s Law kicked in at Intel in the 1990s and the CPU began integrating more and more features. Soon, mainstream x86 CPUs boasted superior floating point performance and began adding special-purpose instructions for multimedia, encryption, and more.

Today, the mantra holds that software running on commodity processors will always win out over special-purpose hardware. But this simply isn’t true.

For one thing, Intel can only add so many special instructions to x86 before they reach a point of diminishing returns. The MMX instructions that sold Pentiums in the 1990s have been superseded by new generations of AVX instructions today but they’re still there, taking up valuable silicon. And although parts of the industry would love to see SHA-256 and AVX-512, Intel seems unsure if these are worth implementing in mainstream processors.

Revenge of the Co-Processors

Although computing today appears to be centralized on x86, this isn’t really true. From supercomputers to machine learning to game machines, non-CPUs are increasingly carrying the compute load. And this shift is even happening in mobile devices and PCs, especially over at Apple.

The counter-trend started with supercomputers and game machines back in the 2000s.

Sony, Toshiba, and IBM developed a heterogeneous CPU, the Cell Broadband Engine, which coupled a PowerPC core and multiple floating point coprocessor units. Cell became well-known when it was selected as the processor for the Playstation 3, but it was developed for use in high-performance computing (HPC) and in supercomputers. Sadly, some of the other architectural choices in the Cell processor limited its impact outside games.

Intel had similar projects in the works at the time, Larrabee (with many-threaded cores that could be used for graphics or specialized computing) and the Single-chip Cloud Computer (with 48 cores connected with a mesh network). Larrabee and Cell shared some interesting aspects: They could be used for graphics or for general-purpose math, and they could be integrated with a general-purpose processor or used in add-in cards.

As clusters of chips supporting AMD’s 64-bit x86 extensions began to dominate supercomputing, Intel embarked on an effort to develop a truly massive multi-CPU cluster. They evolved those research projects into Knights Ferry, a 32-core PCIe add-in card for scientific computing. The next generation of Intel’s “manycore” processor development gained a formal name: Xeon Phi.

Today, Xeon Phi is the key component of the fastest supercomputers on earth, with multiple petaFLOPS projects deployed, and it has lead to x86 becoming by far the dominant supercomputer platform. But these aren’t just Xeon CPUs, despite the name. For Knights Corner, Intel went back to the history of x86 and dusted off the P64C core from the original Pentium, while today’s Knights Landing uses Airmont Atom cores. And Knights Landing is the first bootable Xeon Phi, though most designs incorporate conventional Xeon CPUs as well.

The Rise of GPGPU

Nvidia took inspiration from the world of supercomputers to create a “general-purpose computing” API, CUDA. The rest of the industry followed with a standard GPGPU API, OpenCL, that allowed mainstream computers to leverage the computing power of the shaders and vector processing units inside GPUs from Nvidia, AMD, Intel, PowerVR, and more to offload simple computing tasks in parallel.

The massive computing power of higher-end GPU cards cannot be denied: With thousands of compute units available, no CPU can keep up. From machine learning to cryptocurrency mining, GPUs aren’t just the best choice, they’re the only thing that makes sense.

Although the shaders and FPUs in them have been opened up somewhat, GPUs still aren’t designed for general-purpose computing. For example, only Nvidia has implemented IEEE-compliant floating point math and double-precision numbers. So the next step is GPGPU hardware designed with more than graphics in mind. Intel is tackling this with Knights Mill, their next Xeon Phi, which is specifically designed for machine learning, and it is widely believed that Nvidia is working on specific ML processors as well.

Stephen’s Stance

It may appear that today’s computers are centered on an all-powerful x86 or ARM CPU, but this is rapidly changing. The next designs are likely to include massively-powerful co-processors, from GPUs to special-purpose chips, that radically outperform the CPU through simplification and parallelization.

Although GPGPU is limited to a few applications in Windows, Apple has made widespread use of OpenCL in both macOS and iOS. But it’s the world of iDevices where Apple’s control of both hardware and software really shines: Every device running current revisions of iOS has a base level of GP-GPU power. And this leads me to a rather interesting conclusion, which I will discuss in my next article.

You might also want to read these other posts...

Comments

UMASREE USA says

July 6, 2017 at 6:51 pm

Thanks a lot for introduce us this news. Very interesting, i should learn more about it.
Sean says

January 21, 2018 at 4:06 am

I think another key reason for the prevalence of heterogeneous computing is the fact that frequency scaling died out almost a decade and a half ago and we also hit a power-wall (I guess those two are closely related). Because of those two issues, we’ve naturally had to go multi-core and adapt workloads to use hardware that was more efficient for those tasks; inversely, hardware had to be tweaked to be able to run general purpose processing tasks that the hardware will be good at – an example of this is when GPUs were made general purpose.

GPS Time Rollover Failures Keep Happening (But They’re Almost Done)

This is week “1111111111” in the GPS system. Tomorrow morning it will roll over to week “0000000000”. How well will various systems handle this change? Not well, judging by what we’ve seen so far!

Ranting and Raving About the 2018 iPad Pro

I remain enthusiastic about the iPad Pro, despite getting a scratched screen and my concerns about durability. It’s a worthy successor to the original and offers enough improvements that I’d recommend the upgrade for just about anyone who uses their iPad for serious work. It’s still not yet a laptop replacement, but this is due more to a lack of desktop-class software for iOS than anything in Apple’s control.

Cisco’s Trojan Horse

September 15, 2014

Industry watchers like me have long wondered when Cisco will transform itself into a full-line IT infrastructure vendor. This strategy was tipped in 2009 as Cisco barged into the server market with UCS. But one leg of the stool is still missing: Storage remains the province of Cisco partners like EMC and NetApp.

Virtualized and Distributed Storage: This Time For Sure!

September 2, 2014

We were never able to achieve storage virtualization in mainstream enterprise IT because we lacked the ability to identify and move data non-disruptively. This has been solved by caching and distributed storage solutions, and it’s only a matter of time before the legacy need for centralized storage falls away.

Hands-On Review: Unicomp Spacesaver M Keyboard for Mac

July 3, 2012

I would not hesitate to recommend the Unicomp Spacesaver M to Macintosh users used to an original IBM Model M, and I am admittedly a tough customer. I wish that Unicomp would update their website, packaging, logo, and keyboard graphics, but none of this really matters as your fingers press the keys. If any keyboard is worth $100, it is the Unicomp Spacesaver M!

Preserving Your Credibility Is Your Prime Directive

June 4, 2012

I hope this post isn’t too “out in left field” but I thought it needed to be said. Independent social media has evolved into a powerful mechanism to influence belief, behavior, and (yes) buying. I take my little dollop of influence very seriously, and feel an incredible responsibility to live up to the trust placed in me by others. I will try every day not to let you all down!

Download My Free E-Book, “Essential Enterprise Storage Concepts”!

April 4, 2017

I’ve got a lot to say about storage, as you might have noticed from reading my blog. So I finally sat down and wrote a book on enterprise storage. Now you can download the e-book for free, thanks to support from my friends at SolarWinds!

Deduplication Coming to Primary Storage

September 16, 2008

Although deduplication of storage is nothing new, with Data Domain and other making hay with the technique for years, it has never been ready for prime time – reduction of active primary storage applications like email and databases. Instead, deduplication has been relegated to second- or third-tier status, deduplicating archives and backup data. But change is in the air, and deduplication vendors are starting to bustle towards the bright lights of primary storage.

The Terrifying True Story Of Virtual Machine Mobility

December 22, 2011

Virtualization of server, network, and storage services illuminates the link between physical resources and functional applications. A running virtual machine can instantly move from one server, network adapter, HBA, or LUN to another. And when it happens, traditional components have no idea how to react.

Introducing Rabbit: I Bought a Cloud!

September 10, 2020

We live in a world of cattle, not pets, and Kubernetes rules the roost. I’ve been meaning to spend some time getting up to speed on the latest but didn’t have enough hardware to make that happen until now. I recently bought a whole pile of surplus hardware so I will be able to experiment with orchestration and container platforms in the office.