Multicore for Portables
Multicore is the future of portable designs, but there are a number of routes to get there and no shortage of challenges along the way.
This is the Age of Multicore. After 40 years of surfing Moore’s Law to greater and greater levels of performance, a few years ago the semiconductor industry finally started to hit some brick walls thrown up by simple physics. When at 32 nm static power becomes a more difficult problem than dynamic power, it’s time to consider your options.
In 2005 Paul Otellini of Intel—long the leading proponent of “faster is better”—announced, “We are dedicating all of our future product development to multicore designs. We believe this is a key inflection point for the industry.” That train has now left the station with the rest of the industry on board.
Markus Levy, President of the Multicore Association, sees multicore as an inevitable migration path for portable designs. “In order to meet performance requirements with an acceptable battery life, multicore processors are the chosen approach by the silicon providers targeting the next generation of any compute intensive portable devices, for example handsets, netbooks, game consoles, and the like. They will have the most impact in areas handling large amounts of data, namely multimedia applications. Multimedia applications increase performance requirements in both the application itself and the radio/modem side (both dealing with streaming data). Multicore provides a wide range of scaling from low power to high performance, important in these kinds of devices.”
Multicore would certainly seem to make at least as much sense for portable devices as it does for mainframes, servers and notebooks. Since power consumption increases directly with frequency and with the square of the voltage, running multiple cores at reduced frequency and voltage would seem to be a natural—Ohm’s Law enables you to sidestep Moore’s Law. Multicore devices deliver highly scalable performance and low power—what’s not to like?
In researching this article I asked a number of semiconductor and software industry luminaries the same question: “What are the major impediments standing between us and this rosy scenario?” Gary Smith of Gary Smith EDA nailed it up front: “Amdahl’s Law.”
Moore’s Law Meets Amdahl’s Law
Amdahl’s law expresses the law of diminishing returns: the incremental improvement in speedup gained by an improvement of just a portion of a computation diminishes as improvements are added. Stated more formally:
Parallel speedup = 1/(Serial% + (1-Serial%) / Number of processors)
Here Serial% is the percentage of work that must be done serially, and (1-Serial%) is the percentage of work that can be done in parallel. The bottom line is that the advantages of multi-core parallel processing are only available if your code can actually be processed in parallel. To the extent that it can’t, the advantages of multicore quickly start to disappear.
For example, if only 5% of your code involves serial processing, the performance speedup is relatively linear as the number of processors increases (Figure 1). But with 16 processors, the maximum speedup achieved is still only 9x over a single processor. As the percentage of code serialization increases, the maximum speedup as well as the rate of performance gained is reduced significantly. At 50% serialization, the performance gain beyond four processors is almost insignificant.
Figure 1: The effects of Amdahl’s law on applications with varying degrees of serialization and number of processors
This raises a number of interesting questions: In what applications does multicore processing makes the most sense? How can I write code that's well adapted to multicore processing? And is it worth it to try to port my legacy application to a multicore processor?
Types of Parallelism
In their classic book Computer Architecture, Hennessy and Patterson1 recognize four basic types of processors:
Microprocessors have long used pipelines to overlap the execution of instructions to improve performance. The potential overlap among instructions is called instruction-level parallelism (ILP), since the instructions can be evaluated in parallel. Designers can either rely on their compilers to find parallelism statically at compile time or, more commonly, they can rely on processor hardware to help discover and exploit parallelism in their code dynamically. Compilers and often processors can convert loop-level parallelism into ILP.
- Single instruction stream, single data stream (SISD)—In other words, uniprocessors, the basis of most embedded applications.
- Single instruction stream, multiple data streams (SIMD)—The same instruction is executed by multiple processors using different data streams. SIMD computers exploit data-level parallelism by applying the same operations to multiple items of data in parallel. Each processor has its own data memory but a shared instruction memory and a single, shared control processor. Multicore graphics processors usually have SIMD architectures.
- Multiple instruction streams, single data streams (MISD)—Still a science project.
- Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches its own instructions and operates on its own data. MIMD computers exploit thread-level parallelism, since multiple threads operate in parallel. MIMD processors are less specialized than SIMD architectures and are therefore more widely used.
No matter how long your pipeline or how fast you clock a processor, there are limits to ILP. To operate perfectly the architecture assumes an infinite number of virtual registers; perfect branch prediction; perfectly predicted jumps; whole memory addresses are known exactly; and each memory access takes only one clock cycle. Since this perfect world doesn’t exist, ILP can be quite limited or hard to exploit in many applications.
Working at a higher level, thread-level parallelism (TLP) exploits multiple threads of execution that are inherently parallel, where each thread is a separate process with its own instructions and data. For example, a central financial or scientific database may process multiple queries and updates in parallel. Such applications naturally lend themselves to parallel processing.
As with ILP, there are limits to how far you can take TLP. When a number of threads try to simultaneously access shared memory or an I/O port, the processor resorts to spin locks to rationalize access to that resource. The more threads compete for that resource, the more overall performance slows—eventually to the point of deadlock.
Things get interesting when you combine ILP with TLP in simultaneous multithreading (SMT). SMT exploits TLP and ILP simultaneously, with multiple threads using the same issue slots and a single clock cycle. This enables a superscalar processor to achieve more functional parallelism than a single thread can effectively use.
Most new multicore designs are putting less emphasis on ILP, focusing instead on thread-level parallelism, where a multicore approach holds the most promise. So how do you implement these various types of parallelism in hardware?
There are two basic approaches to multiprocessing—asymmetric and symmetric. Systems that incorporate different types of processors are called asymmetric multiprocessing (AMP) systems. Systems with multiple identical CPUs are called symmetric multiprocessing (SMP) systems.
Simple AMP processors are already common in portable designs. TI’s OMAP and ADI’s Blackfin both combine a general purpose processor and a DSP in a single chip. Loops are rolled out to the DSP for fast parallel processing—usually graphics applications, with lots of simple, repetitive calculations—while the CPU handles control and communications. Cell phone SoCs incorporating multiple ARM cores are also common, with one core acting as the CPU and a different core as a graphics coprocessor.
DSP vendors don’t want to be seen as bit players in a multicore architecture. According to Eran Briman, VP of Corporate Marketing at CEVA, “We have cases where our licensees use two DSPs apart from the ARM CPU. They either take two DSPs and partition the baseband functions between them; or sometimes they have two DSPs, one doing the baseband and the other doing voice and audio processing.” Why two DSPs? “The nice thing about such an architecture is definitely power consumption. If you are aiming at some kind of MP3 use case running on your mobile handset, you want to be able to run 30 or 40 hours on a single battery charge. That's going to mean you need to shut down all of your on-chip resources and leave only the audio DSP to do the processing.”
Programs generally map fairly easily to AMP architectures, since program modules or, on a more granular level, specific functions can be assigned to a series of specially designed application processors or cores. The remaining problem then comes down to communication between the cores, which is typically done over buses under the control of the CPU.
With SMP architectures things get more complicated. Here there are two basic architectures. With a centralized shared memory multiprocessor (Figure 2), each processor shares the same memory and I/O system via one or more buses. While the architecture is straightforward, bus bandwidth, bus contention and cache coherence quickly become serious issues when more than a few cores are involved. Careful design of both hardware and software are required to prevent race conditions, where two processors try to write to the same block of memory at the same time, as well as data races, where variables may be updated without ordering by synchronization.
Figure 2: Centralized shared-memory multiprocessor architecture
The alternative is a distributed memory multiprocessor architecture (Figure 3), in which each processor has its own memory and I/O; all a processor nodes share a central interconnection network. Semiconductor vendors vary widely in how they structure interconnects and handle message passing between the various processor cores.
Figure 3: Distributed-memory multiprocessor architecture
Being the dominant processor player in handsets, ARM is strongly supporting multicore—answering the question of whether multicore has a future in portable designs. Actually, ARM has been licensing its ARM11 MPCore—which it bills as the “first integrated multiprocessor core”—since May 2004. ARM currently offers its “second generation” ARM11 MPCore synthesizable processor as well as the new ARM Cortex-A9 MPCore, which is targeted at high-performance mobile handsets.
Cortex-A9 MPCore processors (Figure 4) utilize a dynamic length, 8-stage superscalar, multi-issue pipeline with speculative out-of-order execution. Each core is capable of executing up to four instructions per cycle in devices clocked at more than 1 GHz. The basic configuration is an SMP distributed memory architecture. Each processor can be independently configured for their cache sizes and whether they will support a floating-point unit (FPU), Media Processing Engine (MPE) or Program Trace Macrocell (PTM).
Figure 4: ARM Cortex-A9 MPCore Architecture
The Snoop Control Unit (SCU) is responsible for managing the interconnect, arbitration, communication, cache-2-cache and system memory transfers, cache coherence and other multicore capabilities for all MPCore technology enabled processors.
The Accelerator Coherence Port is an AMBA 3 AXI-compatible slave interface on the SCU (Figure 5)that provides an interconnect point to the processor’s cache hierarchy for a range of system masters that for overall system performance, power consumption or reasons of software simplification are better interfaced directly with the Cortex-A9 MPCore processor. This interface acts as a standard AMBA 3 AXI slave, and supports all standard read and write transactions without any additional coherence requirements placed on attached components.
Figure 5: ARM Cortex-A9 MPCore Accelerator Coherence Port
The Generic Interrupt Controller handles inter-processor communication and the routing and prioritization of system interrupts. Supporting up to 224 independent interrupts, under software control, each interrupt can be distributed across CPU, hardware prioritized, and routed between the operating system and TrustZone software management layer. This routing flexibility and the support for virtualization of interrupts into the operating system provides one of the key features required to enhance the capabilities of a solution utilizing a paravirtualization manager.
Supporting the design configuration of either a single or dual 64-bit AMBA 3 AXI master interface, the processor can provide, at CPU speed, full load balancing of transactions capable of exceeding 12 GB/s into the system interconnect.
On-Chip Packet Processing
One company taking an interesting approach to the multicore programming problem is XMOS. Their XS1-G silicon products include one-, two- and four-core “software defined silicon-programmable devices,” each core with its own memory and I/O system, with direct support for concurrent processing (multi-threading). A high-performance switch fabric supports communication between the processors, and inter-chip buses—a channel-based messaging mechanism called XLinks—are provided so that systems can easily be constructed from multiple chips. The architecture is both multithreaded and event-driven. Any thread can communicate with any other thread in the system using single-cycle communication instructions. Threads can be used to define independent tasks; the event mechanism enables fast and controlled responses to a multitude of signals.
XMOS’ approach makes life easier for programmers. According to Richard Terrill, EVP of Marketing at XMOS, “The bus is a mix of software and hardware. Let’s say you have two functions running different threads and they need to communicate—say thread one needs to pass a validated packet to thread two. If you were doing that in an ASIC or FPGA, you'd have to create a bus, parameterize it, sequester off resources, and—to guarantee timing, performance, and the like—put in all sorts of extra overhead. With us you declare a variable called a channel and the compiler sorts it out. The switch processor on the chip is basically a routing table. The programmer simply asserts a value or overwrites a value onto a variable; if it’s a variable it’s stored in memory and if it's a channel it gets pushed out to the switch fabric. Programmers don't have to worry about the hardware details, they just have to make function calls.”
Another way to address the programming problem is through virtual platforms. According to Simon Davidmann, President and CEO of Imperas Ltd., “More and more people are having to develop more and more software in their products, and they don't want to build prototypes, because that takes too long and involves too much effort, and the prototypes are too unreliable and too expensive. They can't wait for the chips to come back, so they need to work on simulation. In the software world they call that a virtual platform.”
“To effectively develop software for multi-core devices you need ultra-fast models of processors, peripherals, behavioral components, and platforms that can be put together into fast virtual platform and virtual prototype simulations that will run in the 500 MIPS range.” Davidmann defines “three different types of simulation. At the lowest level, hardware virtual platforms tend to be timing cycle accurate. At the top level is virtualization for operating systems, middleware, and stuff like that. In the middle you have software virtual platforms, which are instruction accurate and useful for application development.”
Imperas focuses on software virtual simulation and prototyping. Imperas donated its virtual prototyping software and tools to the Open Virtual Platforms initiative, which it founded. The OVP web site currently hosts 82 high-speed models of ARM, MIPS, ARC, and OR1K processors in homogeneous, heterogeneous, single core, multicore and manycore configurations; models are available in C, C++, SystemC, TLM2.0 etc. Its OVPsim simulator is also available as a free download.
Get With the Program
As XMOS’ Terrill indicated, the real challenge in multicore is programming. As Dave Stewart, CEO at CriticalBlue, points out, “When you've written a large amount of sequential code, sometimes you find that the parallel transition is actually more complex than you expected.” Even if you try to create software blocks that will run independently, data dependencies may require serial execution, and latencies due to core-to-core communication can lead to slowdowns or out-of-order execution, which can result in function failures.
CriticalBlue sets out to reduce the multicore design effort by insulating the programmer from the hardware details. Their product Cascade generates an optimized coprocessor in synthesizable RTL with synthesis scripts, an instruction- and bit-accurate C functional model, and a testbench that verifies the implementation with the same stimuli and expected responses as those of the CPU, ensuring functional equivalence. The implementation then proceeds through the designer's own system-on-chip (SoC), FPGA or structured ASIC design and verification flows.
CriticalBlue is also trying to address the problem of migrating legacy code to modern multicore platforms. “We have examples of people building silicon with hundreds of cores,” he said recently. “I don't think building silicon with multiple cores is the difficult part here. The difficult part is during software migration to multicore making sure that you don't screw things up when you do it.”
Glenn Perry, General Manager for ESL/HDL Design at Mentor Graphics, agrees with Stewart that software is the hard problem with multicore design. “Software developers who have historically relied on certain techniques to ensure that they have thread safety are going to face a whole new level of challenges with multicore, and that's going to require them to either understand a lot more about the details of the system that they previously did need to or it's going to require some tools that can give them a lot more assistance in the process.”
The Tools Challenge
“One of the tools needed,” continued Perry, “is a more comprehensive hardware/software debug platform that really understands and can simulate code that in a multicore environment to let you catch those race conditions and locks that are very hard to diagnose otherwise”—a process Perry refers to as “chasing ‘Heisenbugs’ in a multicore environment.” Since the use of multicore in portable designs is largely about power, Perry says Mentor is reassessing all its tools to better support multicore architectures. In particular their Questa platform has focused on fast power-aware simulation and verification in complex multicore environments.
I asked The Multicore Association’s Levy about the state of EDA tools for multicore designs. “Debug tools, at least for homogeneous multicore architectures, seem to be in reasonably good shape,” he claims. “Design flows and assistance for programming multicore architectures on the other hand, particularly when it comes to migrating existing sequential code to parallel, are virtually non-existent.”
“New approaches have been proposed,” continued Levy, “but they either don't take account of existing software development environments or require significant code refactoring, often into proprietary flows. There are some interesting companies, such as CriticalBlue, that have added multicore software capabilities to their design flows; they have been inspired by the idea that they must work with existing software and existing software development environments. That said, critical areas that still need improvement include tools for analysis, application ‘manipulation’, programming, configuration and debugging as well as more multicore specific runtime software and standards.” The Multicore Association is hard at work developing such standards.
The OS Response
When programming for general-purpose computers—from mainframes to notebook PCs—programmers routinely rely on the operating system (OS) to transparently handle software complexities for them. But to what extent can the small-stack RTOSs in battery-powered, handheld devices assist applications that are complex enough to require a multicore solution?
According to Wind River, quite a bit. The company recently introduced VxWorks version 6.7, which adds support for multicore asymmetrical multiprocessing (AMP) to its flagship RTOS, which already supports SMP. Marc Brown, vice president of VxWorks marketing at Wind River, claims VxWorks supports true concurrent execution of tasks and interrupts; it contains a priority-based preemptive scheduler that manages the concurrent execution of tasks and automatic load balancing on different CPUs; and it has specialized mechanisms for precise synchronization between tasks and interrupts received simultaneously on different CPUs. All of this enables development of SMP applications to begin without physical hardware.
Wind River configures their OS differently for AMP and SMP configurations. In an AMP configuration (Figure 6a), different instantiations of the operating system run on each processor or core. Each processor/OS combination is really a single-processor computer in its own right. In an SMP configuration (Figure 6b) one operating system controls more than one identical processor or core. Applications interact with only one operating system, just as they do in single-processor systems. The fact that there are several processors in the system is a detail that the OS hides from the user.
Figure 6a: An asymmetric multiprocessing OS configuration
Figure 6b: A symmetric multiprocessing OS configuration
VxWorks also attempts to hide the details of inter-core communication from the programmer. In the case of SMP systems, this isn’t too difficult. Since all the processors are identical, an SMP operating system is able to dispatch any work to any processor in the system. As a result of hardware abstraction and load balancing in the system, the SMP OS simplifies the task of developing software to run on SMP hardware.
The OS faces a harder task with AMP systems. Here each core hosts a separate OS. The cores are not only physically different, they may be hosting different OSs—say VxWorks, Linux and Windows Embedded—or just a portion of an OS. Not only will each processor be handling different tasks, but the different OSs need to be able to communicate with each other and coordinate with the overall system, whether through a bank of shared memory, different buses or a network, whether on-chip or off. It’s little wonder that OS support for AMP systems is so late coming to market.
Ultimately there’s clearly a limit to how far the OS can abstract away all the details and toil of writing code for a multicore system. That’s were software virtualization comes in.
We’re Virtually There
Virtualization technology enables multiple operating systems to run simultaneously on the same single-core or multi-core processor. The guest OSs are independent from each other, but can cooperate via various communication mechanisms.
The basis for any virtualization scheme is the hypervisor, which resides either in hardware or software and serves as an abstraction layer between the processor and the OS. Hypervisors play a critical role in enabling complex OSs such as Linux and Windows to work on the wide range of processors found in embedded applications while working in close to real time (RTOS). Hypervisor vendors such as Trango, Real-Time Systems, Wind River and VirtualLogix all take somewhat different approaches to the problem.
So how do hypervisors work? According to Fadi Nasser, Director of Product Management for VirtualLogix, “Suppose you have shared memory in SMP architecture. How do these share memory? This is all controlled by the hypervisor. It's in the virtualization layer that we manage processes and shared memory, and we manage access to them prevent collisions. It's not just about performance. We provide the infrastructure which enables [different OSs] to communicate with one another unaware that they are actually using shared memory at the bottom; they think they're using sockets to communicate.”
CriticalBlue’s Stewart is skeptical about virtualization in portable designs. “Virtualization and hypervisors are heavily used in the server space, but there is overhead involved in introducing that. The problem with the portable space and the deeply embedded area is that you’ve got to maintain the performance, and the overhead of putting virtualization and place in a deeply embedded environment is too high.”
Nasser strongly disagrees. “You need to distinguish between server- and enterprise-class virtualization solutions and embedded virtualization solutions. In the embedded space we have to ensure that performance is maintained and not sacrifice because we introduced this thin layer called the hypervisor. You also have to be sure you don't sacrifice memory footprint. There is definitely a cost, but the costs in terms of CPU of utilization is in the range of 3%; in terms of I/O throughput it’s in the range of 1-4%; and in some cases we've established near-native I/O performance. So while there is a small cost, these solutions are built from the ground up with emphasis on performance.”
The Road Ahead
While arguments over architectures and implementations will continue to evolve, one thing is increasingly clear—multicore architectures are the future of portable designs. As for the roadmap, I’ll let Gary Smith have the last word: “Multicore for portable? I think everybody's accepted that already. We just have to figure out how to get there.”
1 John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach. San Francisco: Morgan Kaufman Publishers, 2007.
ARM Inc., Sunnyvale, CA (408) 734 5600 www.arm.com
CEVA, Inc., San Jose, CA (408) 514 2900 www.ceva-dsp.com
CriticalBlue, San Jose, CA (408) 573 3609 www.criticalblue.com
The Multicore Association, El Dorado Hills, CA (530) 672-9113 www.multicore-association.org
Gary Smith EDA, Santa Clara, CA (408) 985-2929 www.garysmitheda.com
Imperas, Ltd., Thames, Oxfordshire, U.K. +44 1844 217114 www.imperas.com
Mentor Graphics Corporation Wilsonville, OR (503) 685-7000 www.mentor.com
VirtualLogix, Sunnyvale, CA (408) 636-2804 www.virtuallogix.com
XMOS Semiconductor, Bristol, UK +44 (0)117 915 1271 www.xmos.com
This article first appeared in the March/April issue of Portable Design. Reprinted with permission.