Grok-ability and the multicore processor
April 01, 2010
Mixing multicore complexity with disparate tools and architectures is like holding a United Nations meeting inside your chip.
Hypothetically disregarding money, time, or commercial constraints, and with physics the only limitation, how would you design the ultimate computer processor? Would it be massively parallel, run at insanely high frequencies, and use exotic optical or quantum interconnects? Would it run familiar software, like an x86 or PowerPC, or have a new optimized instruction set? Would it be very large or very small? Would it require intelligent compilers or unique software constructs?
For many years, all designers had to do to make processors go faster was crank up their clock speed. That worked fine until power consumption and the associated heat dissipation caught up with the speed increases. Beyond that point, going faster meant doing something besides simply going faster.
Multicore goes faster, but …
Thus began the multicore era. If two heads are better than one, then four must be twice as good. To some extent, that axiom is true. But today’s dual- and quad-core processors aren’t running two to four times faster than the previous generation.
There are two reasons for this: hardware and software. The great majority of today’s multicore chips don’t scale very well, so four cores don’t really offer four times the performance of a single-core implementation. On-chip buses can’t keep up, cache coherence overhead eats performance, pipelines stall too frequently, and so on. For a variety of reasons, conventional microprocessor architectures don’t come close to doubling performance when their core count is doubled.
On the software side, many programmers aren’t comfortable or familiar with multicore programming. This is especially true when the multicore chip in question includes different types of processor cores (often called a heterogeneous architecture). Programming one processor is hard enough; programming four different ones with separate tool chains is exponentially more complex.
Heterogeneous, homogeneous, or just humungous?
An argument can be made that different compute problems require different resources, and that microprocessors should therefore include a spectrum of different processing resources. For instance, some tasks might require signal-processing ability, others might require single instruction multiple data vector processing, while still others might involve complex decision trees and massive data movement.
One school of thought is that no one processor architecture can efficiently handle all these different tasks; thus, a mosaic of different architectures is required. In the extreme case, one can envision a processor made up of wildly different compute engines with nothing in common but the package they share. These processors are really cohabitating, not cooperating.
The opposite approach is to choose one instruction set and stick to it. This no doubt simplifies programming but runs the risk of deploying overly generic processors that aren’t fine-tuned to a particular task. On the other hand, processors are programmable, and it’s easier and cheaper to change software than hardware.
Ease of programming is not a trivial issue, either. Delays are typically caused by software bugs, not hardware problems. Complicating things further, programmers are scared to death of multicore processors. Getting one high-end processor to work reliably is hard enough; how do you program and debug 10 of them? It’s easier to program, debug, and design with one core architecture instead of dealing with an amalgam of different cores with different instruction sets, architectures, buses, tools, and debug methods.
Intel and AMD have taken most of this advice to heart and produced dual- and quad-core versions of their legacy x86 architectures. In part, that’s simply making an asset out of a liability; x86 is what they know how to do, and backward compatibility is vitally important to their markets. Existing x86 code runs nicely on these upgraded designs, though it rarely runs much faster than before or makes significant use of the additional cores.
In contrast, many RISC CPU and network processor (NPU) vendors have taken a radically different approach, mixing an assortment of different processor cores and architectures into a variety of Swiss Army knife designs. IBM’s famous Cell processor (Figure 1), for example, has one general-purpose processor core plus eight specialty cores, requiring different tools and programming techniques. Several wide buses – some rings, some more conventional – connect the cores in various ways. Cell’s performance is impressive, but PlayStation programmers complain that Cell is a tough beast to tame, partly because managing bandwidth, latency, bus transactions, and coherence are all part of the game.
It’s one thing to corral all the right hardware resources onto a single chip; it’s quite another to make the combination usable. Massively parallel chips with a mixture of architectures combine the worst of both worlds: large-scale multicore complexity with disparate and distinct tools and architectures. It’s like holding a United Nations meeting inside your chip.
A better approach is to keep the massively parallel part, which is de rigueur for high performance, but ditch the differences and connect lots of the same processor core together in a two-dimensional mesh. In concept, it’s not much different from connecting individual computers over a network, just on a microscopic scale.
Meshing also has “grok-ability” on its side. It’s not hard for programmers to wrap their heads around the idea of ten, 100, or 1,000 identical processor cores working the same way and communicating with one another in a simple yet mostly transparent way. Whether each of the 1,000 elements is perfectly tuned for a given job is almost irrelevant; what’s important is there are 1,000 processors to throw at a problem.
Such a homogeneous arrangement also aids scalability. While Cell-like combinations are well-suited to their specific tasks, building a larger or smaller version of Cell requires a significant amount of redesign work from the chip maker, and even more work from the programmer on the receiving end. Existing Cell code won’t magically scale up or down to a chip with a different mix of resources. It might not run at all. In contrast, adding 25 percent more processors to a mesh of identical processors adds 25 percent more computing power without breaking existing code.
That doesn’t mean designing this type of chip is trivial. Bandwidth between and among the cores is the first challenge. If the cores can’t talk to each other efficiently, there’s not much point in connecting them. An example of this approach is Tilera’s TILE-Gx100 processor (Figure 2) packed with 100 identical cores. In this processor, the bandwidth between adjacent cores is 1,100 Gbps. With four connections per core in the north/south/east/west directions, the 100-core processor has an aggregate bandwidth of 200 TBps. Most applications would be hard-pressed to use a fraction of that. Even Tilera’s relatively modest Gx16 chip with a 4x4 array of cores boasts 20 TBps of on-chip bandwidth.
Another challenge of this tile-based design is memory latency. If memory isn’t close or accessible enough, all those processors can grind to a halt. Here again, Tilera breaks up its device into easily replicated tiles, each with its own local L1 and L2 cache. Interestingly, even though the memory is local to each tile, it can also be part of a larger shared distributed cache that maintains coherence among all of the sharers. In some scenarios, programmers might want to define an arbitrary number of islands of cache coherence, either cooperating with or ignoring neighboring tiles as necessary.
The overall chip architecture is like a fabric of computing. Identical blocks of logic, memory, and interconnect are replicated in rows and columns to make larger or smaller chips. And like an FPGA or fractal Mandelbrot image, a tiled processor looks the same at any scale. Large or small, it’s programmed the same way. Scalability squared.
Like a quad-core x86 but unlike Cell or NVIDIA chips, the TILE-Gx mesh interconnect works transparently under the hood. Mesh traffic doesn’t need to be massaged manually, nor do transactions need to be hand-tuned to avoid conflicts or arbitration. As central as it is, the mesh is basically invisible, which is the way programmers like it.
Scalability ultimately wins
As with most ecosystems, many different kinds of processors will survive. Some will thrive, while others will barely eke out a living in some specific niche. Outside forces will cull the herd, as has happened with graphics and network processors, winnowing those that don’t fit the current environment.
For the past few decades, scalability and programmability have been the keys. Developers want a chip they can understand and stick with for successive generations. They want a roadmap for growth, both up and down the price/performance scale. And making it really, really fast doesn’t hurt, either.